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ABSTRACT 


The  annealing  algorithm  is  a  popular  Monte-Carlo  algorithm  for  combina¬ 
torial  optimization.  The  annealing  algorithm  consists  of  simulating  a  nonsta¬ 
tionary  finite  state  Markov  chain  whose  state  space  is  the  domain  of  the  cost 
function,  called  energy,  to  be  minimized.  The  degree  of  randomization  in  the 
annealing  algorithm  is  controlled  by  a  parameter,  called  temperature,  which  is 
slowly  decreased  to  zero.  The  convergence  in  probability  and  the  rate  of  con¬ 
vergence  of  the  annealing  chain  for  the  special  case  of  an  energy  function 
with  two  local  minima  is  analyzed.  The  sample  path  properties  of  annealing 
chains  (with  arbitrary  energy  functions)  are  examined.  A  modification  of  the 
annealing  algorithm  which  makes  noisy  measurements  of  the  energy  function 
is  given.  The  annealing  algorithm  is  extended  for  optimization  on  general 
spaces. 

Tv 

The  Langevin  algorithm  is  a  popular  Monte-Carlo  algorithm  for  multivariate 
optimization.  The  Langevin  algorithm  consists  of  simulating  a  nonstationary 
diffusion  process.  The  relationship  between  the  annealing  and  Langevin  algo¬ 
rithms  is  studied.  It  is  shown  that  an  annealing  chain  driven  by  white  Gaus¬ 
sian  noise  and  interpolated  into  a  piecewise  constant  process  converges 
weakly  to  a  time-scaled  Langevin  diffusion.  Motivated  by  this  result,  a  hybrid 
annealing/Langevin  algorithm  is  proposed. 
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FREQUENTLY  USED  NOTATION 


Iff ,  the  natural  numbers 

If ,  r-dimensional  Euclidean  space 

If,  Borel  subsets  of  If 

Cr[0,T],  If -valued  continuous  functions  on  [0,T] 

Dr[0,T],  If -valued  cddlag  functions  on  [0,T] 

N(m,A)  (*)>  Normal  measure  with  mean  m  and  covariance  A 
XA(-),  indicator  function  of  the  set  A 
B(a,R),  ball  of  radius  R  centered  at  a 
|x|,  Euclidean  norm  of  x 
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(x,y),  Euclidean  inner  product  of  x,y 
x(g)y,  Euclidean  outer  product  of  x,y 
a.e.,  almost  everywhere 
w.p.l,  with  probability  one 
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i.o.„  infinately  often 


a. a.,  almost  always 
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CHAPTER  I 
INTRODUCTION 


Algorithms  for  finding  a  global  extremum  of  a  real-valued  function  may 
be  classified  into  two  groups:  deterministic  and  random.  The  distinction  here 
is  of  course  that  the  random  or  Monte-Carlo  algorithms  make  use  of  pseudo 
random  variates  whereas  the  deterministic  algorithms  do  not.  The  earliest 
global  optimization  algorithms  were  of  the  deterministic  type  and  were 
associated  with  evaluating  the  cost  function  at  points  on  a  grid.  One 
drawback  of  these  methods  is  that  they  typically  require  certain  prior 
information  about  the  cost  function  such  as  a  Lipshitz  constant.  Most  global 
optimization  algorithms  are  of  the  random  type  and  are  related  to  the  so- 
called  multistart  algorithm.  In  this  approach,  a  local  optimization  algorithm 
is  run  from  different  starting  points  which  are  selected  at  random,  usually 
from  a  uniform  distribution  on  the  domain  of  the  cost  function.  See  [5],  [29] 
for  a  discussion  of  global  optimization  algorithms. 

Recently,  motivated  by  hard  combinatorial  optimization  problems  such 
as  arise  in  computer  design  and  operations  research,  Kirkpatrick  et.  al.  [19] 
and  independently  Cerny  [3]  have  proposed  a  different  kind  of  random 
algorithm  called  simulated  annealing.  The  annealing  algorithm  is  based  on  an 
analogy  between  large  scale  optimization  problems  and  statistical  mechanics. 
For  our  purposes  this  analogy  consists  simply  of  viewing  the  cost  function  as 
an  energy  function  defined  on  a  finite  state  space  of  an  imaginary  physical 
system.  The  annealing  algorithm  is  then  seen  as  a  variation  on  a  Monte- 
Carlo  algorithm  developed  by  Metropolis  et.  al.  [25]  for  making  statistical 
mechanics  calculations,  which  we  now  describe.  It  is  well-known  that  the 
states  of  a  physical  system  in  thermal  equilibrium  obey  a  Gibbs  distribution 
oc  exp[— U(’)/T],  where  U(*)  is  an  energy  function  and  T  is  the  temperature. 
The  Metropolis  algorithm  was  developed  for  obtaining  samples  from  such  a 
Gibbs  distribution  and  for  computing  estimates  of  functionals  averaged  over 
the  Gibbs  distribution.  The  Metropolis  algorithm  proceeds  as  follows: 

Given  a  state  i  of  the  system,  select  a  candidate  state  j  in  a  random 

manner  corresponding  to  a  small  perturbation  of  the  system,  and 


compute  the  change  in  energy  AU  =  U(j)  —  U(i).  If  AU  <  0  accept 
state  j  as  the  new  state  for  the  next  iteration  of  the  algorithm.  If 
AU  >  0  accept  state  j  with  probability  exp[—  AU/T];  otherwise  the 
algorithm  starts  at  state  i  for  the  next  iteration. 

The  annealing  algorithm  consists  of  identifying  the  cost  function  to  be 
minimized  with  the  energy  function  U(*)  and  taking  the  temperature  T  as  a 
function  of  time  and  slowly  lowering  it  to  zero.  Suppose  that  the  distribution 
of  a  candidate  state  is  independent  of  past  states  given  the  current  state. 
Then  it  is  clear  that  the  Metropolis  algorithm  simulates  the  sample  paths  of  a 
Markov  chain,  and  it  can  be  shown  that  if  the  candidate  states  are  selected 
in  a  suitable  manner  then  this  chain  infact  has  a  Gibbs  distribution 
oc  exp[—  U(i)/T]  as  its  (unique)  equilibrium  distribution  (see  Chapter  2  for 
details).  Furthermore  as  the  temperature  T  is  decreased  to  zero  the  Gibbs 
distribution  concentrates  more  and  more  on  the  lower  energy  states.  The 
motivation  behind  the  annealing  algorithm  is  that  if  T — >0  slowly  enough  such 
that  the  system  is  never  far  away  from  equilibrium,  then  presumably  there  is 
convergence  (in  some  probabilistic  sense)  to  the  global  minima  of  U('). 

The  annealing  algorithm  stands  in  contrast  to  heuristic  methods  for 
combinatorial  optimization  which  are  based  on  iterative  improvement, 
allowing  only  decreases  in  the  cost  function  at  each  iteration.  Iterative 
improvement  algorithms  in  statistical  mechanics  terms  correspond  to  rapidly 
quenching  a  system  from  a  high  to  a  very  low  temperature.  Such  quenching 
can  result  in  the  system  getting  trapped  in  a  so-called  metastable  state,  and 
analogously  the  iterative  improvement  algorithm  getting  trapped  in  a  strictly 
local  minimum  of  the  cost  function.  On  the  other  hand,  the  annealing 
algorithm  corresponds  to  slowly  cooling  a  system.  Such  cooling  should  result 
in  the  system  spending  most  of  its  time  among  low  energy  states  and 
analogously  the  annealing  algorithm  finding  a  global  or  nearly  global 
minimum  of  the  cost  function. 

The  annealing  algorithm  as  described  above  is  suitable  for  combinatorial 
optimization.  Motivated  by  optimization  problems  with  continuous  variables 
which  arise  in  image  processing  problems,  Geman  and  independently 
Grenander  [13]  have  proposed  a  diffusion-type  algorithm  called  the  Langevin 
algorithm  (as  coined  by  Gidas  [l  1  ]).  Consider  the  diffusion  solution  of  the 
Langevin  equation 

dx(t)  =  —  VU(x(t))dt  +  V/2T  dw(t) 

where  U(*)  is  now  a  smooth  function  on  r-dimensional  Euclidean  space  (again 
called  energy),  T  is  a  positive  constant  (again  called  temperature),  and  w(')  is 
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a  standard  r-dimensional  Wiener  process.  The  Langevin  equation  describes 
the  motion  of  a  particle  in  a  viscous  fluid.  The  Langevin  algorithm  consists  of 
identifying  the  cost  function  to  be  minimized  with  the  energy  function  U(*) 
and  taking  the  temperature  T  as  a  function  of  time  and  slowly  lowering  it  to 
zero.  Now  it  is  well  known  that  under  suitable  conditions  on  U(*)  the 
diffusion  solution  of  the  Langevin  equation  has  a  Gibbs  density 
oc  exp[—  U(*)/T]  as  its  (unique)  equilibrium  density,  and  as  the  temperature  T 
is  decreased  to  zero  this  density  becomes  more  and  more  concentrated  on  the 
lower  energy  states.  Like  the  annealing  algorithm,  the  motivation  behind  the 
Langevin  algorithm  is  that  if  T— *0  slowly  enough  such  that  the  system  is 
never  far  away  from  equilibrium,  then  presumably  there  is  convergence  (in 
some  probabilistic  sense)  to  the  global  minima  of  U(’). 

The  annealing  algorithm  has  been  applied  with  varying  success  to  a  wide 
range  of  problems  including  circuit  placement  and  wire  routing  for  VLSI  chip 
design  [19],  image  reconstruction  [8],  and  assorted  hard  combinatorial 
problems  which  arise  in  operations  research  [3],  [12],  [18],  [19].  There  has  also 
been  intense  theoretical  interest  in  both  the  annealing  algorithm  [8],  [10],  [ll], 
[14],  [15],  [26],  [31]  and  the  Langevin  algorithm  [4],  [9],  [11],  [15],  [21]. 

The  goal  of  this  thesis  may  simply  be  stated  as  the  analysis  of  the 
asymptotic  (large  time)  behavior  of  simulated  annealing  type  algorithms,  by 
which  we  mean  not  only  the  annealing  algorithm  but  also  the  Langevin  and 
related  algorithms.  We  are  particularly  interested  in  the  relationship  between 
the  annealing  and  Langevin  algorithms.  Here  is  a  Chapter-by-Chapter 
outline  of  the  thesis. 

In  Chapter  2  we  discuss  the  finite  state  annealing  algorithm  as  proposed 
by  Kirkpatrick  and  independently  by  Cerny.  In  2.1  we  give  a  precise 
description  of  the  annealing  chain  (the  Markov  chain  whose  sample  paths  are 
simulated  in  the  annealing  algorithm).  We  then  briefly  discuss  two  numerical 
studies  of  the  annealing  algorithm  by  Johnson  et.  al.  [18]  and  Golden  and 
Skiscim  [12],  and  next  describe  some  of  the  large  body  of  theoretical  work  on 
the  subject  with  particular  emphasis  on  the  work  of  Mitra  et.  al.  [26]  and 
Hajek  [14].  In  2.2  we  study  the  asymptotic  behavior  of  a  class  of 
nonstationary  finite  state  Markov  chains  in  preparation  for  the  analysis  of  the 
annealing  algorithm  itself.  In  2.3  we  use  the  results  of  2.2  to  analyze  the 
annealing  algorithm.  We  first  examine  in  depth  the  convergence  in 
probability  and  the  rate  of  convergence  of  the  annealing  chain  to  the  globally 
minimum  energy  state  for  an  energy  function  with  two  local  minima  (one 
strictly  local  and  one  global).  Although  cost  functions  encountered  in  large 
scale  combinational  problems  may  have  large  numbers  of  local  minima,  the 
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results  we  present  are  new  and  offer  some  interesting  insights.  We  next 
perform  a  sample  path  analysis  of  the  annealing  chain  and  obtain  conditions 
under  which  the  annealing  chain  visits  the  set  of  global  minima  of  the  energy 
function  with  probability  one,  visits  the  set  of  global  minima  with  probability 
strictly  less  than  one,  or  converges  to  the  set  of  global  minima  with 
probability  one.  These  results  are  different  than  most  of  the  analytical  results 
on  the  annealing  algorithm,  which  give  conditions  under  which  the  annealing 
chain  converges  to  the  set  of  global  minima  in  probability.  In  2.4  we  describe 
and  analyze  a  modification  of  the  annealing  algorithm  which  uses  noisy 
measurements  of  the  energy  function. 

In  Chapter  3  we  extend  the  annealing  algorithm  for  optimization  on 
general  spaces.  In  3.1  we  give  a  precise  description  of  a  general  state 
annealing  chain.  In  3.2  we  discuss  the  ergodicity  of  the  general  state 
annealing  chain  at  a  fixed  temperature,  i.e.,  we  discuss  a  general  state  version 
of  the  Metropolis  algorithm.  Here  we  settle  some  technical  issues  which  do 
not  arise  in  the  finite  state  Metropolis  algorithm.  In  3.3  we  study  the 
asymptotic  behavior  of  a  class  of  nonstationary  general  state  Markov  chains 
in  preparation  for  the  analysis  of  the  general  state  annealing  algorithm  itself. 
In  3.4  we  use  the  results  of  3.3  to  extend  the  result  of  2.3  on  the  finite  state 
annealing  chain  visiting  the  set  of  global  minima  of  the  energy  function  with 
probability  one  to  the  general  state  case,  essentially  under  the  conditions  that 
the  state  space  be  a  compact  metric  space  and  the  energy  function  be 
continuous.  It  is  not  known  whether  convergence  to  the  set  of  global  minima 
in  probability  can  be  obtained  under  such  weak  conditions. 

In  Chapter  4  we  discuss  the  Langevin  algorithm  as  proposed  by  Geman 
and  independently  by  Grenander.  In  4.1  we  give  a  precise  description  of  the 
Langevin  algorithm  and  summarize  the  convergence  results  of  Geman  and 
Hwang  [9],  Gidas  [11],  and  Kushner  [21],  In  4.2,  4.3  we  present  what  we 
believe  to  be  the  most  interesting  results  of  the  thesis.  In  4.2  we  show  that  an 
annealing  chain  of  the  type  considered  in  Chapter  3  with  r-dimensional 
Euclidean  state  space  and  driven  by  white  Gaussian  noise  converges  in  a 
certain  sense  to  a  Langevin  diffusion.  In  4.3  we  propose  a  hybrid 
annealing/Langevin  algorithm  based  on  the  results  of  4.2.  We  argue  that  the 
hybrid  algorithm  enjoys  the  advantages  of  both  the  annealing  and  Langevin 
algorithms.  Unfortunately,  we  have  not  yet  succeeded  in  establishing  the 
convergence  of  the  hybrid  algorithm  and  this  is  left  as  a  future  task. 

In  Chapter  5  we  collect  the  results  of  the  thesis  and  make  some 
concluding  remarks. 


13 


CHAPTER  H 

FINITE  STATE  ANNEALING  TYPE  ALGORITHMS 

2.1  Introduction  to  the  Annealing  Algorithm 

In  Chapter  1  we  briefly  described  the  annealing  algorithm  and  discussed 
the  heuristic  motivation  based  on  the  connection  that  Kirkpatrick  [19]  has 
suggested  between  statistical  mechanics  and  large-scale  optimization 
problems.  Mathematically,  the  annealing  algorithm  consists  of  simulating  a 
nonstationary  finite-state  Markov  chain  whose  state  space  is  the  domain  of 
the  cost  function  (called  energy)  to  be  minimized.  In  this  Section  we  shall 
discuss  in  detail  the  annealing  algorithm  and  describe  some  of  the 
considerable  literature  which  has  been  devoted  to  its  analysis. 

We  first  give  some  standard  finite  state  space  Markov  chain  notation  (c.f. 
[6],  [7]).  Let  £  be  a  finite  set.  P  =  [PijjjjeE  is  a  stochastic  matrix  on  £  if 
Pij  >  0  for  all  i,j€£  and 

E  Pij  =  1  v  ie£ . 

j€E 

{p(k,k+i)}  =  {[Pijk,lc+1^]}  are  the  1-step  transition  matrices  for  a  Markov  chain 
{£k}  with  state  space  £  if  for  every  kE®  p(k’k+1)  is  a  stochastic  matrix  on  £ 
and 

P«k+l  -  jfck  -  i>  -  pf-k+,)  (if  P{fk  =  i>  >  0)  (2.1) 

for  all  i,je£.  Conversely,  given  a  sequence  {p('c-k+1)}  =  {[p/jk,k+dl]}  of 
stochastic  matrices  on  £  we  can  construct  on  a  suitable  probability  space 
(A,F,P)  a  Markov  chain  with  state  space  £  which  satisfies  (2.1).  For 

each  dE®let 

\ 

p(k,k+d)  _  p(k,k+l) . p(k+d-l,k+d) 

p(k,k+d)  _  [p.(k»k+d)j  jg  a  stochastic  matrix  on  £  and 

P«k+d  =  ilfk  -  i}  =  Pi(J1'•1‘+<i,  (if  P{fk  =  i>  >  0) 

for  all  i,jE£-  It  will  be  convenient  to  have  a  fixed  version  of  the  conditional 
probability  of  £k+d  given  which  we  define  by 


iL'al/lf.'iL  iL'aft.  ii’iL'iL 


L,iL'i»4i».iVA 


mrmr  .t 


P{«k+d6A|fk  =  '■>  -  £  PiSk'k+d) 

J6A 


for  all  i£E  and  ACE. 

We  now  define  the  annealing  algorithm.  Let  U(*)  be  a  nonnegative 
function  on  E,  called  the  energy  function.  The  goal  is  to  find  a  point  in  E 
which  minimizes  or  nearly  minimizes  U(*).  Let  {Tk}  be  a  sequence  of  positive 
numbers,  called  the  temperature  schedule.  Let  Q  =  [qj  be  a  stochastic  matrix 
on  E.  Now  let  {£k}  be  the  Markov  chain  with  state  space  E  and  1-step 
transition  matrices  |p(k-lc+1)}  =  {(p/jk’k+1^]}  given  by 


Qij  exp 


if  U(j)  >  U(i) 


Ck.k+1)  = 

Plj 


% 

1  -  E  Pi?’k+1) 


if  U(j)  <  U(i)  ,  j*i  (2.2) 


if  j  -i 


for  all  i,j£E.  {£k}  shall  be  called  the  annealing  chain.  For  each  d£N  let 
Qd  =  [q^].  Recall  that  Q  is  irreducible  if  for  every  i,j£E  there  exists  a  d£N 
such  that  q^i  >  0.  Also,  Q  is  symmetric  if  q^  =  q^  for  all  i,j£E.  In  the 
special  case  where  Q  is  irreducible  and  symmetric  and  Tk  =  T,  a  positive 
constant,  {fk}  is  the  stationary  Markov  chain  introduced  by  Metropolis  et.  al. 
[25]  for  computing  statistics  of  a  physical  system  in  thermal  equilibrium  at 
temperature  T.  It  was  Kirkpatrick  et.  al.  [19]  and  Cerny  [3]  who  suggested 
that  the  Metropolis  scheme  could  be  used  for  minimizing  U(*)  by  letting 
T  =  Tk  — ►  0.  We  shall  call  the  algorithm  which  simulates  the  sample  paths 
of  {£k}  with  Tk — >0  the  annealing  algorithm. 

The  heuristic  motivation  behind  the  annealing  algorithm  was  discussed 
(briefly)  in  Chapter  1.  Here  we  give  the  motivation  in  more  mathematical 
terms.  Suppose  that  Q  is  irreducible  and  symmetric,  and  let  be  the 

stationary  chain  with  1-step  (stationary)  transition  matrix  PT  =  [pj]  given  by 
the  r.h.s  of  (2.2)  with  Tk  =  T,  a  positive  constant.  Then  it  can  be  shown  that 
PT  has  an  invariant  Gibbs  vector  nT  =  [7riTj  (a  row  vector),  i.e., 


nT  =  IlTPn 


where 


■"  ^  v  A  «  .  s  «■  r,  ^  »•,  •  v*  • 1  O*  ■/**  cV"  s'  \  vO  vVS’a'vs  v\* 


srvn 


«W 


I 


e: 


E  «p  [-  u(j)/x] 

j€£ 


This  follows  from  the  detailed  reversibility 

^TPiT  =  ^PjT  V  i,j€S  . 

Furthermore,  Q  irreducible  and  symmetric  implies  that  is  an  irreduciblef 
(and  aperiodic)  chain  and  by  the  Markov  Convergence  Theorem  [6,  p.  177 ) 

lim  ?{£]?■  =  i}  =  7TiT  V  i(E£  .  (2.3) 

k— ►oo 


Let  S  be  the  set  of  global  minima  of  U(*),  i.e. 

S  =  {iGS:  U(i)  <U(j)  VjGE}  . 


lim  ^  =  TTj 

T— *0 


ViGE 


where  II  =  ]  is  a  probability  vector  with  support  in  S.  In  view  of  (2.3) 

and  (2.4)  the  idea  behind  the  annealing  algorithm  is  that  by  choosing 
T  =  Tk— *0  slowly  enough  hopefully 


and  then  perhaps 


Ptfk  -  i}  ~  "i  k  (k  large) 


lim  P{£k  =  ’»}  =  V  iGE 

k— »oo 


and  consequently  £k  converges  in  probability  to  S. 

In  Chapter  1  we  roughly  described  the  procedure  by  which  the  sample 
paths  of  the  annealing  chain  are  simulated.  It  is  seen  that  the  Q  matrix 
governs  the  small  perturbations  in  the  system  configurations  which  are  then 
accepted  or  rejected  probabilistically  depending  on  the  corresponding  energy 
changes  and  the  temperature.  More  precisely,  the  annealing  chain  may  be 
simulated  as  follows.  Suppose  £k  =  i.  Then  generate  a  E-valued  random 
variable  rj  with  P{r?  =  j}  =  qy.  Suppose  rj  =  j.  Then  set 


fA  stationary  chain  is  irreducible  if  its  1-step  (stationary)  transition  matrix  is 

irreducible. 


.v .v  %  -.  %  ■,  \  >.  •  •  -»  •.  -.  %  %  - .  ■ s.  \  %  v  . 

r-'Z-  ' s  . v-’-.'S- . ''-/-.  v'  -.- ./a-.'  c  v s' -.•  % 


.%  .  \  .*. 


>  ,\ 


£k+l  — 


if  U(j)  <  U(i) 

if  U(j)  >  U(i)  with  probability  exp 


uwnwi 


J 
J 

i  else 


u(j)  -  U(i) 

Tk 


There  are  two  in  depth  numerical  studies  of  simulated  annealing  of  which 
we  are  aware.  Johnson  et.  al.  [18]  applied  the  annealing  algorithm  to  four 
well-studied  problems  in  combinational  optimization:  graph  partitioning, 
number  partitioning,  graph  coloring,  and  the  travelling  salesman  problem. 
They  compare  the  annealing  algorithm  with  the  best  of  the  traditional 
algorithms  for  each  problem.  They  found  that  although  annealing  is  able  to 
produce  quite  good  solutions  on  three  of  the  four  problems,  only  on  one  of  the 
four  (graph  partitioning)  does  it  outperform  the  best  of  its  rivals.  Golden  and 
Skiscim  [12]  have  tested  the  annealing  algorithm  on  routing  and  location 
problems,  specifically  the  travelling  salesman  problem  and  the  p-median 
problem.  They  conclude  that  there  are  more  efficient  and  effective  heuristics 
for  these  problems. 

We  shall  now  outline  the  convergence  results  on  the  annealing  algorithm 
which  are  known  to  us.  We  refer  the  reader  to  the  specific  papers  for  full 
details. 

Geman  and  Genian  [8]  were  the  first  to  obtain  a  convergence  result  for 
the  annealing  algorithm.  The  consider  a  version  of  the  annealing  algorithm 
which  they  call  the  Gibbs  sampler.  They  show  that  for  temperature  schedules 
of  the  form 


Tk 


c 

log  k 


(k  large) 


that  if  c  is  sufficiently  large  then  (2.6)  is  obtained. 

Gidas  [10]  also  considers  the  convergence  of  the  annealing  algorithm  and 
similar  algorithms  based  on  Markov  chain  sampling  methods  related  to  the 
Metropolis  method. 

We  next  discuss  the  work  of  Mitra  et.  al.  [26].  The  idea  behind  their 
work  is  similar  to  that  of  Geman  and  Geman  and  also  Gidas  in  that  they 
show  that  for  temperature  schedules  which  vary  slowly  enough  the  annealing 
chain  reaches  “quasiequilibrium”,  i.e.,  something  like  (2.5)  holds.  In  order  to 
state  Mitra  et.  al.’s  result  we  will  need  the  following  notation.  Let 
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N(i)  =  {j€E  :  qij  >0}  V  i£E  . 

Let  SM  be  the  set  of  states  that  are  local  maxima  of  U(*),  i.e., 
SM  =  {ieE:  U(i)>U(j)  V  j€N(i)}  . 

Let 


r  =  min  max  d(i,j) 

ies\sM  jgs 

where  d(i,j)  is  the  minimum  number  of  steps  to  get  from  state  i  to  state  j. 
Finally,  let 

L  =  max  max  (U(j)  -  U(i)|  . 
i€E  j€N(i) 

Here  is  Mitra  et.  al.’s  result: 


Theorem  2.1  (Mitra  et.  al.  [26]) 
Let  Tk|0  and 


oo 

£  exP 


k  —  1 


Assume  Q  is  irreducible  and  symmetricf. 


r  L 

Tkr-1 


=  OO  . 


(2.7) 


Then 


lim  P{£k  —  i}  =  TTj  V  i€E  .  (2.8) 


Remarks 

(1)  If  Tk  =  c/log  k  then  (2.7)  holds  iff  c  >  r  L. 

(2)  An  estimate  of  the  rate  of  convergence  in  (2.8)  is  obtained  for 
annealing  schedules  of  the  form  Tk  =  c/log  k  for  c  >  r  L.  Let 

w  =  min  min  q..  , 

•€-  j€N(i)  'J 


It  is  shown  that 


Of  =  min  U(i)  —  min  U(j)  . 
i€-\S  j6S 


P{£k  —  ’»}  =  ° 


’ 

1 


as  k — *  x 


where 


(2.9) 


for  just  qj.  >  0  iff  q,,  >  0  for  all  i,j£E 


Q  = 


w 


r  L/C 


_  2. 


Since  or  and  0  are  increasing  and  decreasing  respectively  with  increasing  c,  it 
is  suggested  that  c  >  r  L  be  chosen  to  maximize  min{o;,/?}. 


We  next  discuss  the  work  of  Hajek  [14].  The  idea  behind  his  work  is 
that  for  temperature  schedules  which  vary  slowly  enough,  the  annealing  chain 
escapes  from  local  minima  of  U(*)  at  essentially  the  same  rate  as  for  a 
constant  temperature.  In  order  to  state  Hajek’s  result  we  will  need  the 
following  notation.  We  shall  say  that  given  states  i  and  j,  i  can  reach  j  if 
there  exists  a  sequence  of  states  i  =  i0,...,ip  =  j  such  that  q^^  >  0  for  all 
n  *  0,„.,p— 1;  if  U(in)  <  E  (a  nonnegative  number)  for  all  n  =  0,...,p  then  we 
shall  say  that  i  can  reach  j  at  height  E.  We  shall  say  that  the  annealing 
chain  is  strongly  irreducible  if  i  can  reach  j  for  all  i,jE£.  Clearly,  strong 
irreducibility  is  equivalent  to  Q  irreducible,  but  we  introduce  strong 
irreducibility  to  conform  with  Hajek’s  notation.  We  shall  also  say  that  the 
annealing  chain  is  weakly  reversible  if  for  every  E  >  0,  i  can  reach  j  at  energy 
E  iff  j  can  reach  i  at  energy  E,  for  all  i,jE£.  Let  Sm  be  the  states  that  are 
local  minima  of  U(*),  i.e., 

Sm*{i€£:  U(i)  <  U(j)  V  j£N(i)}  . 

For  each  i€Sm\S  let  A(i)  be  the  smallest  number  E  such  that  i  can  reach  some 
j€:£  with  U(j)  <  U(i)  at  height  U(i)  +  E.  A(i)  is  the  “depth”  of  the  local  (but 
not  global)  minimum  i.  Let 

A  =  max  A(i)  .  (2.10) 

i€Sm\S 

Here  is  Hajek’s  result: 


Theorem  2.2  (Hajek  [14])  Assume  that  the  annealing  chain  is  strongly 
irreducible  and  weakly  reversible.  Let  Tk  jO.  Then 

lim  P{£icGS}  =  1  (2.11) 

k— »3C 


iff 

Tk 


V  exp 
k-l 


x  . 


(2.12) 


Remark  If  Tk  =  c/log  k  then  (2.12)  and  hence  (2.11)  holds  iff  c  >  A*.  For 
this  reason  A  has  been  called  the  optimal  constant  and  Tk  =  A  /log  k  the 
optimal  schedule. 


We  should  also  mention  that  Tsitsiklis  [30]  has  proved  of  generalization 
of  Theorem  2.2  which  does  not  assume  weak  reversibility,  using  (and 
extending)  the  theory  of  singularly  perturbed  Markov  chains. 

In  view  of  Theorem  2.2  and  the  refinement  in  [30],  the  analysis  of  the 
convergence  in  probability  of  the  annealing  algorithm  is  essentially  complete, 
with  the  exception  that  it  does  not  appear  that  anyone  has  determined  the 
rate  of  convergence  for  optimal  or  nearly  optimal  temperature  schedules. 
Recall  that  Mitra  et.  al.  have  shown  that  (2.9)  holds  if 


but  r  L  is  in  general  much  larger  than  A  .  In  2.2,  2.3  we  shall  analyze  the 
rate  of  convergence  in  probability  of  the  annealing  algorithm  for  a  special 
case  with  two  local  minima.  We  will  obtain  results  on  the  convergence  rate 
for  nonparametric  temperature  schedules  (schedules  not  of  the  form 
Tk  =  c/log  k)  and  also  for  temperature  schedules  Tk  =  c/log  k  for  c  >  A  . 
We  remark  that  in  the  latter  case  with  c  =  A  there  is  apparently  some 
interesting  and  unexpected  behavior.  Our  results  are  different  although 
consistent  with  (2.9). 

Also  in  2.2,  2.3  we  shall  explore  the  sample  path  behavior  (as  opposed  to 
the  ensemble  behavior)  of  the  annealing  algorithm.  We  shall  give  a  number 
of  results,  the  most  important  of  which  is  conditions  such  that  the  annealing 
chain  visit  the  set  S  (infinitely  often)  with  probability  one.  Suppose  we  let 


fk»l  if  U(fk  +  1)  <  U(ck) 

£k  else  . 


Note  that  if  {£k}  visits  S  with  probability  one  then  traps  in  S  with 

probability  one,  and  furthermore  no  additional  evaluations  of  U(’)  are 
required  to  compute  {<;k}  over  what  are  required  to  simulate  Hence  by 

just  doubling  the  memory  requirements  and  keeping  track  of  {Cki-  it  seems 
sufficient  to  show  that  {^k}  visit  S  with  probability  one  rather  than  converge 
to  S  in  probability.  Now  it  might  be  imagined  that  the  conditions  on  the 
temperature  schedule  under  which  {£k}  visits  S  with  probability  one  are 


weaker  than  those  under  which  {£k}  converges  to  S  in  probability.  However, 
the  proof  of  Theorem  2.2  shows  that  (assuming  strong  irreducibility  and  weak 
reversibility)  {fk}  visits  S  with  probability  one  iff  (2.12)  holds.  From  this 
point  of  view  our  result  does  not  offer  anything  new;  infact  the  temperature 
schedules  we  consider  are  not  even  optimal.  However,  we  believe  our  result  is 
important  in  the  following  sense.  In  Chapter  3  we  extend  the  annealing 
algorithm  to  general  state  spaces.  It  turns  out  that  our  result  on  the  finite 
state  annealing  chain  visiting  S  infinitely  often  with  probability  one  can  also 
be  extended,  essentially  under  the  condition  that  the  state  space  be  a 
compact  metric  space  and  the  energy  function  be  continuous.  It  is  not  clear 
whether  convergence  to  S  in  probability  can  be  shown  in  such  a  general 
setting;  the  methods  used  to  analyze  the  finite  state  case  (quasiequilibrium 
distributions,  large  deviations  and  perturbation  theory)  do  not  seem  directly 
applicable. 

Finally,  in  2.4  we  give  a  modification  of  the  annealing  algorithm  which 
allows  for  noisy  measurements  of  the  energy  function  and  examine  its 
convergence. 

2.2  Asymptotic  Analysis  of  a  Class  of  Nonstationary  Markov  Chains 

In  this  Section  we  analyze  the  asymptotic  properties  of  a  certain  class  of 
nonstationary  (finite  state)  Markov  chains.  These  chains  will  have  the 
property  that  their  1-step  transition  probabilities  will  satisfy  bounds  similar 
to  those  satisfied  by  the  d-step  transition  probabilities  of  the  annealing  chain. 
The  results  of  this  Section  will  be  used  in  2.3  to  deduce  corresponding 
asymptotic  properties  of  the  annealing  chain. 

We  shall  consider  the  following  class  of  Markov  chains.  Let  S  be  a  finite 
set.  Let  ajj,  £  [0,oo]  for  i,j££,  and  {#k}  a  sequence  of  real  numbers  with 
0  <  <  1.  Let  {£k}  be  a  Markov  chain  with  state  space  E  and  1-step 
transition  matrices  {p(k’k+1)}  =  {[pfJk,lc+1']}  with  the  following  property:  there 
exists  positive  numbers  A,  B  such  that 

p,?^)  >  A  6l"  (2.13) 

pjk.k+o  <  B  (2.14) 

for  all  i,j££.  Actually,  we  shall  assume  that  (2.13)  and/or  (2.14)  hold 
depending  on  the  result  we  wish  to  prove. 


.'V\.VW5Vl’«WJ 


2.2.1  Convergence  in  Probability  and  Rate  of  Convergence  for  a 
Three  State  System 

We  now  establish  the  convergence  in  probability  and  rate  of  convergence 
of  a  Markov  chain  {£k}  with  state  space  £  which  satisfies  (2.13)  and  (2.14)  for 
a  special  case  with  |£|  =  3.  In  2.3.2  we  shall  apply  this  result  to  the 
annealing  chain  with  an  energy  function  which  has  two  local  minima.  It  will 
be  useful  here  to  consider  the  more  detailed  bounds 


Vk "  ^  pf'k+1)  <  VkiJ  v  bj€£  , 

where  Ajj,  Bjj  are  positive  constants.  Here  is  our  theorem. 
Theorem  2.3  Let  £  =  {1,2,3}  and  assume  that  (2.15)  holds.  Let 

a  =  max{a;21,  o;3i}  <  oo  , 
b  =  min{/312,  (3l3 }  >  a  , 

7  —  b  —  a  , 

lmin{A21,  A3I}  if  o21  =  c*31 


(2.15) 


8  —  1  A, 


(a)  Suppose  that  0klO  and 


if  or21  >  or3 
if  CXn |  <  Oo 


Then 


E  *k  =  oo  . 

k— 1 


lim  P{£k  =  1}  =  1  . 

k— -oo 


(2.16) 


(b)  Suppose  (more  strongly)  that  #k|.0  and  there  exists  a  sequence  {ek} 
with  0  <  ek  <  1  and  ek — ►!  such  that 


V  4-  -r  loe  9\,  — ►  oo  as  k — *oc  . 


(2.171 


P{£k  =  l}  =  1  4-  0(0,?)  as  k — >00  . 


The  proof  of  Theorem  2.3  will  require  the  following  lemmas. 

Lemma  2.1  Let  {sk}  be  a  sequence  of  positive  numbers  with  sk — >0  and 

OO 

sk  =  OO  . 

k— 1 

Then 

00  k-l 

e  sk  n  a  -  o  <  00 . 

k-l  n— 1 


Proof  Let 

Sk  =  E  sk  • 

n-1 

Now  since  sk — »0  and  Sk — *oo  we  have 

exp(-  Sk_j)  =  exp(sk)  exp(-  Sk)  <  — 

Sk 

for  some  constant  c.  Hence 


00  k-l 

e  sk  n  (1 


k-l  n— 1 


sn)  <  E  sk  exP(—  Sk-l) 


k-l 


<  C  • 


00 


E 

k-l 


<  OO 


where  the  convergence  of  the  last  series  follows  from  the  Abel-Dini  Theorem 
[20,  p.  290].  □ 

Lemma  2.2  Let  b  >  a  >  0  and  assume  that  0k|.O  and 

OO 

E  #k  =  00  .  (2.19) 

k-l 

Then 


lira  £  #“  n  (1  -  «»*)  =  0  . 

k-*'00  m-1  n-m+1 


Proof  Let 


Pk  =  E  ^  n  (1  -  <£)  • 


m-1  n-m+1 


Let  sk  =  #k.  Then  for  KG3T 


Pk  =  s  •£'*  n  a  - »„) 

m— 1  n— m+1 

<  K  •  sf^  n  (1  -  sB)  4-  6£+1  £  sm  n  (1  -  s n)  v  k  >  K  , 

n— K  m-1  n-m+1 


where  7  =  b  —  a  >  0.  Hence 


k  k 

lim  sup  pk  <  9£+l  sup  £  sm  n  (1  ~  sn) 


k— 00 


m— 1  n— m+1 


(2.20) 


since 


00 


n  n  (1  -  «.*)  =  0 

n— K  n-K 

which  follows  from  (2.19).  Now 

E  3m  FI  (1  sn)  “  E  sm  IT  (l  ~  sn) 

m— 1  n— m+1  m— 1  n— 1 

which  is  established  by  induction  on  k.  Hence  by  Lemma  2.1 

k  k 

S^P  E  sm  n  (1  -  sj  <  00  .  (2.21) 

k  m-1  n-m+1 

Combining  (2.20),  (2.21)  and  letting  K— kx>  (so  that  #j£+1 — K))  gives  pk— >0  as 
required.  □ 

Lemma  2.3  Let  b>a>0,  7  =  b  —  a,  and  assume  that  #k|0  and  there 
exists  a  sequence  {ek}  with  0  <  ek  <  1  and  ek— < >0  such  that 

9. 

(2.22) 


kvk 

sup  — —  <  00 
k 


Then 


Proof  Let 


E  6m  n  (!  -  *m)  =  <W)  as  k—rc 

rn-kv k  n”m+1 


n=  s  4  n  (i-o- 


m-kvk  n*m+1 


Let  sk  =  #k.  Then 


p*  -  b  4A  n  a  -  h) 

m-k-.j  "-“+1 

<  t  £  »m  n  (1  -  s»)  • 

k  m-1  n-m+1 


(2.23) 


k  k  k  m-1 

E  sm  n  (1  -  3n)  =  E  Sm  Ft  I1  ~  Sn) 

m— 1  n— m+1  m-1  n— 1 

which  is  established  by  induction  on  k.  Hence  by  Lemma  2.1  there  exists  a 
constant  cx  such  that 


E  sm  II  C1  ~  sn)  <  Cj  • 

m-1  n-m+1 

Also  from  (2.22)  there  exists  a  constant  c2  such  that 

0,7  <  c2  *  K  . 

kvk  —  i  K 

Combining  (2.23)-(2.25)  gives  pk  =  O(0k)  as  required.  □ 

Proof  of  Theorem  2.3 

(a)  Define  the  events 

n 

Cm>n  =  n  {£k€{2,3}}  V  n  >  m  , 

k-m 

Dm,n  =  {£m  =  X}  H  Cm+l  n  V  n  >  m  . 


(2.24) 


(2.25) 


(2.26) 


(2.27) 


Then 


{^e{2,3}}  =  Cljk  U  u  Dm,k 

m=l 


'  ~r . .9tr 


and 


k~l 


fe,e{2>3}}  -  ck.<tk  U  U  Dm,k 


m— k*<k 


k— 1 


P{ek€{2,3}}  =  PC  +  £  PDmk. 


m-k*fk 


From  (2.29)  we  have 


(2.33) 


pck.<tk<c,  n  (i -us 

n—k*<k 


<  ct  exp 


k-l 


-  S  se, 

n-k*<k 


=  Ci  exp  (<5#k)  exp 


-  <5 


S  0»*  +  T1°sf,k 


n— k*ek 


60 


=  o(90)  as  k— kx>  , 


(2.34) 


where  the  last  equality  follows  from  0£ — >0  and  (2.17).  From  (2.30)  and 
Lemma  2.2 

s'  e'  %  n‘  (!-«».*) 

m-kv  k  m-kvk  n_m+1 

=  O(0k)  as  k— KX)  .  (2.35) 

Combining  (2.33)-(2.35)  gives  P{£k  =  l}  —  1  4-  0(60)  as  required.  □ 

The  following  corollary  considers  a  choice  of  {#k}  which  will  be  seen  to 
correspond  to  a  temperature  schedule  Tk  =  c/log  k  for  the  annealing 
algorithm. 


Corollary  2.1  Let  E,  a,  b,  7,  and  8  be  given  as  in  Theorem  2.3.  Assume 

that 


where  c  is  a  positive  constant. 

(a)  If  c  >  a  then 

lim  P{4  -  1}  =  1  . 

k— »oo 

(b)  If  c  >  a  then 

P{£k  =  1}  =  l  +  O(^)  as  k— +oo  . 

(c)  If  c  =  a  then 

1  +  0(0,2)  if  7  <  8 

P{£k  =  l}  =1  +  O log  k)  if  7  =  8 

1  +  O(0k)  if  7  >  8  ,  as  k— +oo  . 

Proof  We  shall  assume  that  c  =  1;  the  general  case  follows  easily, 
(a)  If  a  <  1  then 


OO  OQ  1 

E  E  77  =  °° 

k-1  k— 1  k 


and  Theorem  2.3(a)  applies. 

(b)  Suppose  a  <  1.  To  apply  Theorem  2.3(b)  we  must  construct  a 
sequence  {ek}  with  0  <  ek  <  1  and  ek— *1  such  that  conditions  (2.17),  (2.18) 
are  satisfied.  Fix  0  <  r]  <  1— a  and  let 

ek  =  1  -  ~  (k  large)  . 


Then  for  sufficiently  large  k 


n— k*«k 


>  /  ^  dx 

k(l— k— ')  x* 


after  evaluating  the  integrel  and  applying  the  Mean  Value  Theorem.  Hence 

V.  &a  +  ~T  1°S  K  >  '7k1-*-''  -  —  log  k  — *•  oo  as  k— *oc  , 

0  0 

n-kvk 

and  consequently  (2.17)  is  satisfied.  (2.18)  is  also  satisfied.  Hence  Theorem 
2.3  (b)  applies. 

(c)  Suppose  a  ==  1.  It  is  not  apparent  in  this  case  how  to  construct  the 
{ek}  sequence  which  is  necessary  to  apply  Theorem  2.3  (b).  However,  we  can 
directly  use  (2.28)-(2.30)  to  get  the  desired  estimate  of  =  l}.  So,  from 
(2.28) 


Now  from  (2.29) 


P{fl6{2,3,}>  -  PC,ik  +  £  PD„,k  . 

m— 1 


pcu  <  e,  n  (i  - 5  w 

o-l 


k-1  i 

<  Cj  exp  —  -5  V  — 
.1".  n 


*  i 

<  Cj  exp  —  (5  f  —  dx 


(2.36) 


(2.37) 


Also,  from  (2.30) 


r  / 


E  PDm,k  <  c2  E  ^m  0  (1-6  01) 

m-1  m-1  n-m+1 

k-1  i  k-1  i 

<'2  E  —  «p  -«  E  - 


a— 1  111 


n— m+1 


k-1  j  K  1 

<  c2  E  ~ i b  exP  “  5  /  ~  dx 

m-1  m  m+1  x 

Co  k-1  i 

-  Ti  S  ~b  •  <m  +  ■) 

k  i  m 
2c2  k-1  i 

<77  E  -gir 

k  m_i  m 

since  (p  +  q)r  <  pr  +  qr  for  p,q  >0,  0  <  r  <  1.  Since  6  <  1  (use  6X 
(2.15))  and  b  >  a  =  1  we  have  b  —  6  >  0.  Hence 

E  PDm,k  <  (i  +  J  ~h dx 

m-1  K  1  x 


1  1 
c3  *  —  -f  c4  •  — 
3  k6  4  V 


if  7^6 


2c2  (1  +  log  k)  *  —  if  7  =  6 

k"* 


(2.38) 


where  c3,  c4  are  suitable  constants.  Combining  (2.36)-(2.38)  completes  the 
proof  of  part  (c).  □ 

2.2.2  Sample  Path  Analysis 

We  now  analyze  the  sample  path  behavior  of  a  Markov  chain  { }  with 
state  space  £  which  satisfies  (2.13)  and/or  (2.14).  We  shall  give  (different) 
conditions  such  that 

•  {£k}  visits  a  subset  of  £  (infinitely  often)  with  probability  one 

•  {^ij  visits  a  subset  of  £  with  probability  less  then  one 

•  1^}  converges  to  (i.e.  eventually  stays  in)  a  subset  of  £  with 

probability  one 


- "  ‘  -•> 


.*■  «*•  >  .N 


It  will  be  convenient  to  use  the  following  notation.  For  J  a  subset  of  £  define 
the  events 


OO 

{£keJ  i.o.}  =  n  u  {£keJ}  , 

n— 1  k>n 


{fkGJ  a. a.}  =  U  D  {£kGJ} 

n-1  k>n 

(i.o.  and  a. a.  stand  for  infinitely  often  and  almost  always,  respectively). 

Our  first  theorem  gives  sufficient  conditions  under  which  {£k}  visits  a 
subset  of  £  infinitely  often  with  probability  one. 

Theorem  2.4  Assume  that  (2.13)  holds.  Let  J  be  a  subset  of  E  and 

a  =  max  min  a-.,  <  co  . 
i€-\J  j€J  J 

Suppose 

E  *  oo . 

k-1 

Then  P{£k&J  i.o.}  =  1. 

Proof  Let  I  =  £\J.  Using  (2.13)  and  the  Markov  property 


(2.39) 

(2.40) 


n 

p  n  {fkei}< 

k-m 


< 


< 


< 


n-l 


P{^ei}  n  maxP{ek+1Gl|^  =  i} 

k-m  >€I 


n  j1  -  min  P{ek+x€Jkk  =  i}  ) 


1  -  mip  E  A  ,J 


iei  ; 


jeJ 


n— l 

n 

k-m 

“ri  (l-A^)  V  n  >  m  . 

k-m 


Hence 

OO 

OO 

P  n  {^€1}  <  II  (1  —  A  #k)  =  0  Vm  , 

k“m  k-m 

where  the  divergence  of  the  infinite  product  follows  from  the  divergence  of  the 
infinite  sum  (2.40),  and  the  Theorem  follows.  □ 


The  next  theorem  gives  sufficient  conditions  that  {£k}  visits  a  subset  of  D 
with  probability  strictly  less  than  one,  at  least  starting  from  certain  initial 
states.  Let  Pj(’)  =  P{'|£i  =  i}  for  all  i£E. 

Theorem  2.5  Assume  that  (2.14)  holds.  Let  J  be  a  subset  of  X!  and 

b  =  max  min  min  3..  >  0  .  (2.41) 

KdJ  ie£\K  jeK  J 

Suppose  that  6 k — >0  and 

V  8y!  <  DC  .  (2.42) 

k-l 

Then  there  exists  an  i£E  such  that 

oc 

p,  u  {eksj}  <  i . 

k-l 

Proof  Let  J  be  a  subset  of  E  containing  J  which  obtains  the  maximum  in 
(2.41)  and  let  I  =  E\J  .  Let  p£l  .  Using  (2.14)  and  the  Markov  property 

“  «  n-i  . 

p,  n  {?kei }  >  n  mi?  P{?k+.ei  lck  =  a 

k“t  k-l  >€l 

=  FI  f1  -  P{4+iGJ’l4  -  >}| 

k-l  l  ) 

n-1  (  j 

>  n  1  -  max  £  B<V' 

k-l  (  ,el  It!' 

>  “n  (1  -B(j‘Kb)  Vn  . 

k-l 

Hence 

OO 

pj  n  {?kei'>  >  n  (i  -  B|j>kb)  >  o 

k"1  k-l 

where  the  convergence  of  the  infinite  product  follows  from  the  convergence  of 
the  infinite  series  (2.42),  and  the  Theorem  follows.  □ 

Finally,  we  give  a  theorem  which  gives  conditions  such  that  |ik' 
converges  to  a  subset  of  E  with  probability  one,  provided  it  visits  that  subset 
infinitely  often  with  probability  one. 
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Theorem  2.8  Assume  (2.14)  holds.  Let  J  be  a  subset  of  L  and 


Suppose  *0  and 


c  =  min  min 

j€J  i€-\ J 


V  #k  <  oc  . 

k-l 


(2.43) 


(2.44) 


Under  these  conditions,  if  P{£kGJ  i.o.}  =  1  then  P{dfkGJ  a. a.}  =  1. 

Proof  Let  I  =  E\J.  Using  (2.14)  and  the  Markov  property 

P{ekGJ,  £k+1€l}  <  P{CkGJ}  max  P{Zk+l&\Zk  -  j} 

j€J 

<  max  V  B  9k“ 

J£J  i€l 

<  b|i  |  et . 

Hence 

E  p(4GJ,  £k+1ei}  <  E  B|I|  et  <  oc 

k-l  k-l 

by  (2.44).  Hence  by  the  “first”  Borel-Cantelli  Lemma  we  must  have 
p{£k&*»  £k+iGI  i.o.}  =  0,  and  it  follows  that  P{£kGJ  a. a}  =  1  whenever 
P{£k€J  i.o.}  =  1.  □ 

2.3  Convergence  of  the  Annealing  Algorithm 

In  this  Section  we  apply  the  results  of  2.2  to  obtain  asymptotic  properties 
of  the  annealing  algorithm.  Throughout  this  Section  (2.3)  we  use  the  notation 
introduced  in  2.1. 

2.3.1  Bounds  on  the  Transition  Probabilities  of  the  Annealing  Chain 

In  order  to  apply  the  results  in  2.2  we  need  to  obtain  bounds  on  the  d- 
step  transition  probabilities  p1^'t,lc+d'  of  the  annealing  chain  {£k}.  Toward  this 
end  we  make  the  following  definitions.  For  every  i,j€^  and  d(E.TIet  Ad(i,j)  be 
the  subset  of  Ed+1  such  that  (i  =  iot...,id  =  j)  £  Ad(i,j)  if 

plk.k+J)  >  Q  V  n  =  o,...,d-l  , 

for  any  kG®  (this  definition  is  valid  since  {Tk}  is  a  positive  sequence  and  so 
p{k,k  +  i)  ^  q  j*or  ajj  ^  wfjenever  p(k  k*i)  o  j*or  some  k).  Hence  \d(i,j)  is  just 


Jr.v.v  a-v  •«>  v >».-  /.-.-.•.-.v.-..  -y-y-s 


i 

& 


*•  .**  **•  . 


*.  .V  *-  \  A 


the  set  of  possible  d-step  transitions  from  state  i  to  state  j  for  the  annealing 
chain.  An  alternate  characterization  of  Ad(i,j)  is  as  follows: 
(i  —  i0,...,id  =  j)  €  Ad(i,j)  iff  for  every  n  =  0,...,d  — 1  either 

CO  <w,  >  0  or 

(ii)  ‘n+i  =  ‘n  and  >  0  f°r  some  with  U(P)  >  U(in). 

This  characterization  follows  easily  from  (2.2). 

For  each  d£3J  let 

Ud(i0,...»id)  =  D  max{0,U(in+i)  -  U(iJ}  , 


for  all  i0,...,id  €  and 


Vd(i,j)  -  inf  Ud(\)  , 

X£Ad(!,j} 

V(ij)  =  infVd(i,j) 

d 


(2.45) 


(2.46) 


for  all  i,jE£.  Note  that  the  infinum  in  (2.46)  is  obtained  for  d  <  (X!(.  Also 
note  that 

v(i,j)  <  V(i,p)  +  V(0j)  v  •  (2.47) 

We  shall  call  Vd(i,j)  the  d-step  transition  energy  from  i  to  j,  and  V(i,j)  the 
transition  energy  from  i  to  j. 

The  following  theorem  gives  upper  and  lower  bounds  on  the  d-step 
transition  probabilities  of  the  annealing  chain  in  terms  of  the  d-step 
transition  energy. 

Theorem  2.7  Let  {T^}  be  monotone  nonincreasing  and  d£N  Then  there 
exists  positive  numbers  A,  B  such  that 


A  exp  — 


VdOJ) 

Tk+d-i 


<  p*k-k+d)  <  B  exp  - 


Vd(i,j) 


V  ij&l  •  (2.48) 


Proof  We  prove  the  lower  bound  in  (2.48);  the  upper  bound  is  similar.  Let 


rk('’j)  =  (k.k  +  t) 


if  j  /  i 
if  j  =  i 


(2.49) 


for  all  i,j€.-,  and 
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d-i 

^k(*o>-”>‘d)  =  [1  rkOn’*n+l)  > 
n-0 


(2.50) 


r('»o.....»d)  =  inf  ?k(*o»---»*d)  . 

k 


(2.51) 


for  all  i0,...,idG£.  If  Xg£<1+1  then  since  {T^}  is  nonincreasing  {r^X)}  is 
nondecreasing  and  so  r(X)  =  r^X)  obtains  the  infimum.  Note  that  r(X)  >  0 
for  all  X£Ad(i,j),  i,jGE 

Now  from  (2.2)  and  (2.49)-(2.51)  we  have  that 

Ud(X) 


pOc.k+d)  >  E  f(X)  exp 
X6Aj(id) 


k+d-1 


V  i,j€ 


fry; 


For  each  iJG£  if  Vd(i,j)  <  oc  let 

N(ij)  -  {X£Aa(ij)  :  U„(X)  -  Vd(iJ)}  #  0 

and  set 


(2.52) 


-  E  ?(x)  >  0  ; 

X€N(ij) 


if  vd(i  j)  =  oo  set  atj  =  1.  Then  from  (2.52) 

V4(U) 


Pijk’k*d)  >  a  exp 


•k+d-1 


V  i,jGE 


where  A  =  min  a;;  >  0  .  □ 

■  J€E  J 


Remark  We  note  that  the  proof  of  Theorem  2.7  is  quite  trivial,  and  we 
would  like  to  point  out  that  our  reason  for  presenting  it  in  detail  is  for 
comparison  with  the  (more  difficult)  proof  of  the  genera!  state  analog 
(Theorem  3.3)  to  come. 


2.3.2  Convergence  in  Probability  and  Rate  of  Convergence  for  Two 
Local  Minima 

We  now  apply  the  results  of  2.2.1  to  establish  the  convergence  in 
probability  and  rate  of  convergence  for  an  annealing  chain  {^k}  with  an 
energy  function  U(-)  with  two  local  minima.  We  shall  consider  the  following 
example  in  detail: 


n 


.  .  .*■ 

V'-y-v.'. 


m 

-•  -•  -v'v' 


(H)  £  =  {1,2,3} 

U(l)  <  U{3)  <  U(2) 

<ll2>  Q2H  <l23>  032  >  0 
qjj  =  0  otherwise  . 

The  annealing  chain  corresponding  to  (H)  is  illustrated  by  the  transition 
diagram  in  Figure  2.1.  Let 

a  =  U(2)  -  U(3)  , 
b  =  U(2)  -  U(l)  , 

1  =  U(3)  -  U(l)  , 

^  =  cl32*<l21  • 

Here  is  our  theorem. 

Theorem  2.8  Assume  the  conditions  in  (H). 

(a)  Suppose  Tkj.O  and 

£  exp  (-  -£-]  =  00  .  (2.53) 


Then 


lim  P{ek  -  1}  -  1 . 

k— »oo 


(b)  Suppose  (more  strongly)  that  Tk|0  and  there  exists  a  sequence  {ek} 

with  0  <  ek  <  1  and  >1  such  that 

' 

^  £  "v  1 

£  exP  “  ~  T  *  ^ - -  00  as  k— *-oo  ,  (2.54) 

Mir  0  1  9lr 


sup  — 
k  Tzk 


<  00 


(2.55) 


Then 


p{£k  =  1}  =  1  +  O  exp  -  — - 


as  k — *  x  . 


(2.56) 


Proof  Let 


Vv  —  ^2k  ) 


£k  ^2k+l  i 


9 k  =  exp 


' 

1 

f 

1 

T2k 

j  ^  exp 

T2k+1 

Then  {r7k},  {fk}  are  Markov  chains  with  1-step  transition  matrices 
|R(k,k+i)|  _  |jr.(k,k+i)j|  ^  |g(k,k+i)j  _  ^[s.(k.k+i)j ^  respectively,  which  satisfy 

<  rik'k+l)  <  bXV>  , 

Vks  ^  s?’k+1)  <  B.Jrk,) . 

for  appropriate  a,:,  /3l},  A^j,  Bjj,  and  it  is  clear  that  these  constants  may  be 
chosen  such  that 

a  =  max{a:21,  #31}  <  00  , 
b  =  min {/?12,  /?13}  >  a  , 

1  -  b  -  a  , 

8  —  A31  . 

Henci  e  are  (almost)  in  a  position  to  apply  Theorem  2.3  to  {r?k}  and  {ftj. 

Suppose  that  TkJ.O  and  (2.53)  holds.  Since  {Tk}  is  nonincreasing,  the 
divergence  of  the  series  in  (2.53)  implies  that  • 

OO  OO 

E  ^k  =  00 ,  E  Tk  =  00  • 

k-1  k-1 

Hence  we  may  apply  Theorem  2.3  (a)  to  {?7k},  {&}  to  get 

lira  P{£2k  =  1}  =  lim  P {r?v  =  l}  =  1  , 


lim  P{£2k+1  =  1}  =  lim  P{fk  =  1}  =  1  , 


and  hence 


lim  P{£k  =  1}  =  1 

k— ►oc 

which  proves  (a). 

Suppose  that  Tk|0  and  (2.54),  (2.55)  hold.  Now  (2.54),  (2.55)  are 
equivalent  to,  respectively, 


LVA_*ti 


%*M  >.»_■  jMjMilUl  I, 


i'Li'Li'I 


E  0n  +  y  log  0k  — ►  oo  as  k— *oc  , 


n— k*ek 


kvk 

sup  — -  <  oo  . 

k  Pk 


(2.57) 


(2.58) 


Hence  we  may  apply  Theorem  2.3  (b)  to  { r? k)  to  get 


p{C2k  =  1}  =  P{??k  =  1}  =  1  +  0  |exp  j-  y— 
We  make  the  following 


as  k— hoc  .  (2.59) 


Claim 


K  /y 

E  rn  +  log  rk  — ►  oo  as  k — *oo  , 


n— k‘ck 


kvk 

sup - <  oo  . 

k  rk 


(2.60) 


(2.61) 


Suppose  the  Claim  is  true.  Then  we  may  apply  Theorem  2.3  (b)  to  {fk} 
to  get 

\  \ 

p{^2k+i  =  !}  =  P{fk  =  1}  =  1  +  O  exp  — - —  as  k— oo  ,  (2.62) 

i2k+l 

/  / 

and  it  would  follow  from  (2.59)  and  (2.62)  that 


P(£k  =  1}  =  1  +  O  exp  — 


as  k — kx)  , 


which  would  prove  (b).  It  remains  to  prove  the  Claim. 


Proof  of  Claim  We  first  show  that 


SkP  T 


2k+l  P2k 


<  OO 


(2.63) 


-L-c-J - ! - + - 1 — 

P2k  P2(k+1)  "^2(k+l)"f|;  ,^2(k+l)vk 


In  view  of  (2.55)  it  is  enough  to  show 


■’SOS?-® 


rTt'iViW 


« /  ^ 


or  since  (TjJ  is  nonincreasing, 

2(k+l),ek  <  2k  (k  large)  . 

Suppose  this  last  inequality  is  not  satisfied.  Then  there  exists  a  sequence  {kn} 
of  positive  integers  with  kntoo  and 


Hence 


knek0  >  kn  —  ekn  >  kn  —  1  . 


lim  inf  £  +  -jr  log  #k 

k— *oo  0 

n-k*ek 


<  lim  +  7  log  eK 

n— *oo  0 


=  —  oo 

which  contradicts  (2.57).  Hence  (2.63)  must  be  true.  Now  using  (2.63)  we 
obtain 

k  k 

S*P  E  °n  ~  E  Tn  <  sUp  -  ^k+l]  <  OO  , 
n— k#€k  n=k#6ic 


sup  (log  ek  -  log  rk)  =  sup  — - —  -  <  oo  , 

k  k  1 2k+i  2k 


T  6 

kVk  kvk  1 

sup  — —  -  — —  <  sup  exp  ~ — 


k  7V 


<  OO  , 


■  2k+l  1  2k 


and  (2.60),  (2.61)  now  follow  from  (2.57),  (2.58).  This  completes  the  proof  of 
the  Claim  and  hence  the  Theorem.  □ 


WWmtfWIWfffWCTillWI 


Corollary  2.2  Assume  the  conditions  in  (H).  Let 

Tk  =  l^T  (k  lar6e) 


where  c  is  a  positive  constant, 

(a)  If  c  >  a  then 


lim  P{£k  =  1}  =  1  . 

k-*co 


(b)  If  c  >  a  then 


p{£k  =  1}  =  1  +  O  exp  - 


as  k — ►oo  . 


(2.64) 


(c)  If  c  =  a  then 


1  +  0  exp  - 

J-k 


P{£k  =  1}  =  1  +  O  exp  —  ~L-  +  log  log  k 

xk 


if  'yCS 


if  =  7 


1  +  0  exp  ( — 


where  ?  =  6/2. 


if  "y  >  7)  ,  as  k- 


(2.65) 


Proof  We  may  use  Corollary  2.1  by  appropriately  identifying  variables.  Let 

Vlt  ~  ^2k  )  fk  =  ?2k+l  » 


1 

k1^  * 


Then  {^k},  {fk}  are  Markov  chains  with  one  step  transition  matrices 
{R(k.k+i)j  =  {[ri(k-k+1)]},  {S(fc'k+I)}  =  {(SiCk'k+1)]},  respectively,  which  satisfy 

Ajj  9k“"  <  rfA+'l,  sfrk+H  <  Blj  «k"  (k  large) 
for  appropriate  ay,  /?y,  Ajj,  By,  and  these  constants  may  be  chosen  such  that 


b  =  min{/?12,/?13}  >  a  , 

7  =  b  —  a  , 

^  —  -A-3J  . 

Hence  we  may  apply  Corollary  2.1  (a)-(c)  to  {r?k},  {<Tk}  to  get  the 
corresponding  (a)-(c)  here.  □ 

Remarks  on  Theorem  2.8  and  Corollary  2.2 

(1)  Theorem  2.8  (a)  is  a  simple  case  of  Theorem  2.2  (Hajek’s  Theorem) 
since  a  =  A(2)  =  A  ,  the  optimal  constant  (see  (2.10)). 

(2)  We  compare  our  results  with  the  rate  of  convergence  (2.9)  given  by 
Mitra  et.  al.  First,  Theorem  2.8  (b)  gives  the  rate  of  convergence  of 
P{£k  *=  l}  to  1  for  nonparametric  temperature  schedules,  in  particular 
schedules  not  of  the  form  T^  —  c/log  k.  This  is  possible  essentially  due  to  the 
application  of  the  Abel-Dini  Theorem  on  infinite  series  in  the  proof  of  Lemma 
2.1.  (2.9)  is  valid  only  for  temperature  schedules  of  the  form  T^  =  c/log  k. 
Second,  Corollary  2.2  (b),  (c)  gives  the  rate  of  convergence  for  temperature 
schedules  of  the  form  Tk  =  c/log  k  for  c  >  a,  whereas  (2.9)  only  holds  for 
c  >  r  L  —  2*[U(2)  —  U(l)j  >  U(2)  —  U(3)  =  a.  Furthermore,  for  c  >  r  L  where 
(2.9)  does  hold,  (2.64)  is  in  general  tighter: 

=  _ l _ 

Recall  that  Mitra  et.  al.  suggest  choosing  c  >  r  L  such  that  min  {a,0}  is 
maximized  (see  (2.9)).  Our  results  suggest  choosing 

a  if  7  <  7 

a  4-  £  if  *7  >  7 

where  0  <  e  <  a  [(7 f5)  —  1]  (see  (2.64)  and  (2.65)).  We  want  to  stress  that 
(2.9)  holds  for  general  U(’)  whereas  we  have  not  been  able  to  extend  Theorem 
2.8  and  Corollary  2.2  to  a  U(’)  with  more  than  two  local  minma. 

(3)  The  proof  of  Theorem  2.8  and  Corollary  2.2  (which  rely  on  Theorem 
2.3  and  Corollary  2.1)  show  that  there  are  two  factors  which  limit  the  rate  at 
which  P{£k  =  1}  converges  to  1.  One  factor  corresponds  to  the  rate  at  which 
the  annealing  chain  makes  transitions  from  state  1  to  state  3  and  back.  For 
temperature  schedules  of  the  form  Tk  =  c/Iog  k  this  factor  dominates  for 
c  >  a  and  has  a  characteristic  time  scale  1/7.  Note  that  7  =  U(3)  -U(l) 


depends  only  on  the  energy  function  U(‘).  The  other  factor  corresponds  to  the 
rate  at  which  the  annealing  chain  makes  it  first  transition  from  state  3  to 
state  1.  For  temperature  schedules  of  the  form  Tk  =  c/iog  k  this  factor  is 
only  important  for  c  =  a  and  has  characteristic  time  scale  \[5.  Note  that 
~5  —  <l32<l2i/2  does  not  depend  on  the  energy  function  U(*).  We  wonder 
whether  there  is  some  physical  significance  to  all  of  this. 

2.3.3  Sample  Path  Analysis 

We  now  apply  the  results  of  2.2.2  to  analyze  the  sample  path  behavior  of 
the  annealing  chain  {£)t}.  To  avoid  trivialities  we  will  need  the  following 
assumptions: 

(PI)  Every  iE£\S  can  reach  some  jES 

(P2)  There  exists  an  i££\S  such  that  for  every  jES,  i  can  only  reach  j 
at  height  greater  than  U(i). 

The  following  theorem  gives  conditions  under  which  the  annealing  chain 
{£k}  visits  S  infinitely  often  with  probability  one.  Let 

V*  =  max  min  V(i,j)  (2.66) 

i€E\S  jeS  v 

Note  that  (Pi)  holds  iff  V*  <  oo. 

Theorem  2.9  Assume  (Pi).  Let  {Tk}  be  monotone  nonincreasing  and 


£  exp  -  —  =  oo  • 

k-i  Ak 


(2.67) 


Then  P{£kES  i.o.}  =  1. 


Proof  We  first  show  there  exists  a  dENsuch  that 

V  =  max  min  Vrf(i,j)  . 
i€L'\S  j€S 

For  every  iEX)\S  there  exists  a  djENsuch  that 

min  Vd(i,j)  =  min  V(i,j)  <  V*  . 

J6S  j^S 

Let  d  =  max  d;.  Now  it  is  easy  to  see  that  for  every  iES 

ieE\s 

min  V  (i,j)  <  min  V_(i,j)  V  n  >  m  . 
j€S  j€S 

Hence  for  every  iE^\S 


(2.68) 


>  vs  ;.  -. '  lo 


fttfc  ««.  *»■  .1. 


min  Vd*{i,j)  =  min  min  Vn(i,j)  <  V 

j€S  n<  d‘  j€S 

and  (2.68)  follows  by  setting  d  =  d*. 

Next,  from  Theorem  2.7  there  exists  a  positive  number  A  such  that 

rvk+<n  ^  .  [  Vd(i,j)  1 


Pif'k+d)  >  A  exp  - 


-k+d-1 


V  i,j€S 


£  k  =  fkd>  #k  =  exp  - 


■kd+d-1 


<*(i,j)  =  Vd(i,j)  Vi,jGD 


(2.69) 


Then  {£k}  is  a  Markov  chain  with  1-step  transition  matrices 
|p(k,k+x)|  =  ||-.(k,k+i)j|  which  satigfy 

pf -k+1)  >  A  V  i,j€S  . 


a  =  max  mm  O'::  . 
i€E\S  jes  J 

By  (2.68)  and  (2.69)  a  =  V  .  Hence  since  {Tk}  is  nonincreasing  the  divergence 
of  the  series  in  (2.67)  implies 

OO 

£  d£  =  oc. 

k— 1 

Hence  we  may  apply  Theorem  2.4  to  {£k}  with  J  =  S  to  get  P{£k£S  i.o.}  =  1 
and  so  P{£kES  i.o.}  =  1.  □ 

Remark  Clearly  V  >  A  ,  the  optimal  constant  (see  (2.10),  (2.66)).  Hence 
(assuming  strong  irreducibility  and  weak  reversibility)  Theorem  2.2  is  a  much 
stronger  result.  However,  the  importance  of  Theorem  2.9  is  that  it  can  be 
extended  to  a  general  state  version  of  the  annealing  algorithm  under 
essentially  the  condition  that  the  state  space  be  a  compact  metric  space  and 
the  energy  function  be  continuous.  This  will  be  done  in  Chapter  3. 

The  next  theorem  gives  conditions  under  which  the  annealing  chain  {d;k} 
visits  S  with  probability  strictly  less  than  one.  Let 


44 


V,  =  max  min  min  V(i,j)  . 
KOS  i€E\K  j£S 


(2.70) 


Note  that  (P2)  and  (2.47)  imply  Vj  >  0. 
Theorem  2.10  Assume  (P2).  Let  Tk— +0  and 


£  e*P  -  —  <  oo 
k-l  Ak 


Then  there  exists  an  i€S  such  that 


Pi  U  {£k€S}  <  1  • 

k— 1 


Proof  From  Theorem  2.7  there  exists  a  positive  number  B  such  that 


>i(jk-k+1)  <  B  exp  VijSS. 

*  k 


Theorem  2.5  may  be  applied  to  {fk}  in  an  obvious  manner.  □ 


Finally,  we  give  a  theorem  which  gives  conditions  such  that  the 
annealing  chain  {£k}  converges  to  S  with  probability  one,  provided  it  visits  S 
infinitely  often  with  probability  one.  Let 


V,  =  min  min  V(j,i)  . 
j€S  i€£\S 


(2.71) 


Theorem  2.11  Let  Tk— *0  and 


£  exp  - 


<  oo  . 


If  P{^k€S  i.o.}  =  1  then  P{£ic€S  a. a.}  =  1. 


Proof  From  Theorem  2.7  there  exits  a  positive  number  B  such  that 


jk,k+i)  <  B  exp  _  v  i)jGv  . 

Tk 


Theorem  2.6  may  be  applied  to  {£k}  in  an  obvious  manner.  □ 


Remark  Theorem  2.2  or  2.9  may  be  combined  with  Theorem  2.11  to  give 
conditions  under  which  the  annealing  chain  {^k}  converges  to  S  with 
probability  one.  Note,  however,  that  is  is  not  always  possible  to  do  this  since 
it  is  not  in  general  true  that  V2  >  V*  or  even  V2  >  A  (see  (2.10),  (2.66), 
(2.71)). 


■’.-•A-.  A  A  /  /,  /.A-*.  A'.i.A  A  -  . 


2.4  Annealing  Algorithm  with  Noisy  Energy  Measurements 

In  this  Section  we  consider  a  modification  of  the  annealing  algorithm  so 
as  to  allow  for  noisy  measurements  of  the  energy  differences  which  are  used  in 
selecting  successive  states.  This  is  important  when  the  energy  differences 
cannot  be  computed  exactly  or  when  it  is  simply  too  costly  to  do  so.  Using 
the  notation  introduced  in  2.1  we  construct  the  modified  annealing  chain  as 
follows.  At  time  k,  given  the  current  state  is  i  we  select  a  candidate  state  j 
with  probability  q^.  We  assume  that  the  energy  difference  U(j)  —  U(i)  is 
measured  with  (additive)  noise,  which  is  independent  of  states  and  candidate 
states  at  times  less  than  or  equal  to  k,  and  noise  at  times  less  than  k.  The 
exponent  of  the  energy  difference  plus  noise  is  then  used  to  determine  whether 
a  transition  is  made  from  i  to  j.  More  precisely,  let  {wk}  be  a  sequence  of  E- 
valued  independent  random  variables.  Construct  a  E-valued  discrete-time 

process  {£k}  with  £k+1  conditionally  independent  of  and  w1,...,wlc_1 

given  £k  and  wk,  and 


P{£k+i  =jl£k  wk  -  w} 


qjj  exp 

U(j)-U(i)+w' 

Tk 

if  U(j)  —  U(i)  +  w  >  0,  j  ^  i  , 
if  U(j)  —  U(i)  -F  w  <  0,  j  ^  i, 


for  all  i,j€S.  It  is  easy  to  see  that  {£k}  is  a  Markov  chain.  Let 
|p(k,k+i)j  _  {[pj(k<lc+1)]}  be  the  1-step  transition  matrices  for  {£k}.  Then  since 
wk  is  independent  of  £k  we  have 


P i?,k+1)  =  E{P{fk+1  =  jtek>  wk}lfk  =  i} 

-E{p{ek+i-jkk-i.  wk}} 


/  qjj  exp 
{ w>U(i) — U(j)} 


U(j)  -  U(i)  +  s 

Tk 


dFk(w) 


+  %  P{wk  <  U(i)  -  U(j)}  V  j  *  i  , 


(2.72) 


where  Fk(*)  is  the  distribution  function  for  wk.  We  shall  call  {£k}  the 
annealing  chain  modified  for  noisy  energy  measurements.  In  the  sequel  we 


shall  only  consider  the  case  where  wk  is  Gaussian  with  mean  0  and  variance 
<7k  >  0.  Hence  (2.72)  can  be  written  as 


P,r+1)  =  /  q,j  exp  -  ^ ^  —  W  dN(0,<7k2)  (w) 

U(i)  -  V(j)  ik 

+  %  N(0,<rk2)  (  oc,  U(i)  -  U(j)]  V  j  *  i  .  (2.73) 

The  following  theorem  shown  that  if  the  noise  variance  goes  to  zero  fast 
enough  then  the  1-step  transition  probabilities  for  the  annealing  chain 
modified  for  (Gaussian  additive)  noisy  measurements  are  asymptotically 
equivalent  to  the  1-step  transition  probabilities  for  the  unmodified  annealing 
chain. 

Theorem  2.12  If 

<rk  =  o(Tk)  as  k— *oc 

then 

q,j  exp 
% 

as  k— *oo,  for  all  i,j(EE. 

Proof  Fix  i,j€£  with  j  ^  i  and  >  0.  Let 

OO 

ak  =  /  qjj  exp 

U(i)  -  U(j) 

bk  =  qij  N(0,crk)  (-oc,  U(i)  -  U(j)] 

so  that  (2.73)  becomes 

P?'k+1)  =  ak  +  bk  .  (2.75 

lim  ak  =  0  if  U(j)  <  U(i)  , 


Clearly, 


(2.76 


lim  bv 

k— 'oc 


ciij  if  U(j)  <  U(i) 

^  if  U(j)  -  U(i) 


(2.77) 


We  make  the  following 
Claim 


ak  -  qjj  exp 


bk  —  o  exp  — 


-  U  l 


-  U  1 


if  U(j)  >  U(i)  and  crk2  -  o(Tk)  (2.78) 


if  U(j)  =  U(i)  and  o\  =  o(Tk2)  (2.79) 


if  U(j)  >  U(i)  and  <rk  =  o(Tk)  (2.80) 


as  k— *oc. 


Suppose  the  Claim  is  true.  Then  combining  (2.75)-(2.80)  gives  (2.74)  if 
<7k  —  o(Tk),  as  required.  It  remains  to  prove  the  Claim. 


Proof  of  Claim  We  have 


*k  "  I  <lij  «*P  “ 
U(i)  -  U(j) 


—  um 


dN(0,<rk)  (w) 


qij  exp 


k  I  TkfU(i)  -  U(j)l 


/  e_w  dN(0,crk2/Tk2)  (w)  (2.81) 


after  a  change  of  variable.  Choose  W  <  0  and  let 

e~w  if  w  >  W 

f(w)  -  |e_w  if  w  <  w  . 

Then  for  sufficiently  large  k 

/  e-  dN(0,<rk/Tk)  (w) 

Tk[U(i)  -  U(j)l 

oo  Tk(U(i)  -  U(j)j 

-  /  f(w)  dN(0,<7k/Tk)  (w)  -  /  f(w)  dN(0,crk/Tk)  (w)  (2.82) 

— oo  -oo 

We  analyze  these  last  two  integrels  as  follows.  First,  if  =  o(Tk)  then 


1  ’  •/  *.*  V*  -  *  V  %  \  .  SA/.  -.V.  .%■  %  '  S  %  A  ••  *•  *.  *.  A  V  \ 


N(0,<7k/Tk)  (*)  converges  weakly  to  the  unit  measure  concentrated  at  the 
origin,  and  since  f(*)  is  a  bounded  and  continuous  function  on  ^ 


lim  f  f(w)  dN(0,<7k/Tk)  (w)  =  f(0)  =  1  . 

k-*°°  -oo 

Next,  if  U(j)  >  U(i)  and  crk  =  o(Tk)  then 
T k[U(i)  -  U(j)) 

/  f(w)  dN(0,<7k2/Tk2)  (w) 

—  OO 

<  e-w  N(0,<7k!/Tk2)  {-oc,  Tk[U(i)  -  U(j)(] 
-  e~w  N(0,1)  ((Tk!/<rk)  •  (U(j)  -  U(i)|,  oo) 


<  e  exp  — 


2(<rk!/Tk') 


0  as  k  — ►  oo  , 


(2.83) 


(2.84) 


where  we  have  used  N(0,l)  (x,oo)  <  exp(— x2/2)  for  x  >  0.  Combining  (2.81)- 
(2.84)  gives  (2.78).  (2.79)  may  be  proved  similarly  by  taking  W  =  0  above. 
As  for  (2.80),  if  U(j)  >  U(i)  and  <7k2  =  o(Tk)  then 

bk  =  N(0,crk)  (-oo,  U(i)  -  U(j)] 

=  N(0,1)  ((l/ak)  •  [U(j)  -  U(i)],oo) 


<  exp  — 


=  o  |exp  - Tk  jj  as  k—oc  . 

again  using  N(0,l)  (x,oc)  <  exp(—  xz/2)  for  x  >  0.  This  completes  the  proof 
of  the  Claim  and  hence  the  Theorem.  □ 

The  asymptotic  behavior  of  the  annealing  chain  modified  for  (Gaussian 
additive)  noisy  measurements  follows  immediately: 


Corollary  2.3  If 


4  -  o(Tk<) 


as  k — ►oo 


then  Theorems  2.1,  2.2,  2.7-2.11  hold  with  {£k}  by  {£k}. 

Remarks 

(1)  The  Corollary  is  more  or  less  obvious,  since  the  convergence  in  (2.74) 
is  uniform  for  i,jEE  (since  E  is  finite);  we  leave  the  details  to  the  reader. 

(2)  We  have  reason  to  believe  that  cr^  =  o(T^)  is  quite  conservative  and 
that  cr^  =  o(T^)  may  suffice. 


.•  .• 


CHAPTER  m 

GENERAL  STATE  ANNEALING  TYPE  ALGORITHMS 


3.1  Introduction  to  the  General  State  Annealing  Algorithm 

In  Chapter  2  we  discussed  the  annealing  algorithm  as  introduced  by 
Kirkpatrick  [19]  and  Cerny  [3]  for  combinatorial  optimization.  In  this  Section 
we  extend  the  annealing  algorithm  for  optimization  on  general  spaces.  The 
general  state  annealing  algorithm  will  consist  of  simulating  a  nonstationary 
Markov  chain  whose  state  space  is  the  domain  of  the  cost  function  (called 
energy)  to  be  minimized.  This  Markov  chain  will  be  a  general  state  space 
analog  of  the  finite  state  annealing  chain  described  in  Chapter  2.  As  far  as 
we  know,  no  one  has  given  a  careful  formulation  of  such  an  algorithm  and 
proved  a  convergence  result.  Indeed,  there  even  seems  to  be  some  question 
regarding  conditions  under  which  the  Metropolis  algorithm,  i.e.,  the  annealing 
algorithm  at  a  fixed  temperature,  may  be  used  for  sampling  from  a  continuous 
Gibbs  distribution  (c.f.  [16]).  Geman  and  independently  Grenander  [13]  have 
suggested  using  diffusions  for  optimization  on  multi-dimensional  Euclidean 
space.  This  approach  and  its  relationship  to  the  general  state  annealing 
algorithm  is  described  in  Chapter  4. 

We  first  give  some  standard  general  state  space  Markov  chain  notation 
(c.f.  [6],  [27]).  Let  E  be  an  arbitrary  set  and  let  B  be  cr-field  of  subsets  of  E. 
P(v)  is  a  stochastic  transition  function  on  (E,J3)  if 

•  for  every  AGP  P(*,A)  is  R-measurable 

•  for  every  xEE  P(x,‘)  is  a  probability  measure  on  (£,5) . 

{Pk(*,*)}  are  the  1-step  transition  functions  for  a  Markov  chain  {£k}  with 
state  space  £  if  for  every  kE®  Pk(v)  is  a  stochastic  transition  function  on 
(£,.£?)  and 

Ptflc+i€A|4}  -  Pk(4-A)  w.p.  1  (3.1) 

for  all  AES.  Conversely  given  a  sequence  {Pk(v)}  of  stochastic  transition 
functions  on  (£,13)  we  can  construct  on  a  suitable  probability  space  (fl,F,P)  a 
Markov  chain  {£k}  with  state  space  £  which  satisfies  (3.1).  For  each  dE^let 
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p(k'k+d)(x,A)  =  /  Pk(x,dXl)  -  /  Pk+d-2(x<l-2.dxd_1)  P k+d-i(xd-i>A) 

for  all  x£E  and  A£B.  p(k-k+d)(*>«)  is  a  stochastic  transition  function  on  (E,.B) 
and 

P{fk+d6Al<;k}  -  P<k'k+d>(4,A)  w.p.  1 

for  all  A £B.  It  will  be  convenient  to  have  a  fixed  version  of  the  conditional 
probability  of  fk+d  given  fk  which  we  define  by 

P{4+d€A|4  -  *}  =  P(k’k+d)M 

for  all  x£E  and  A£B. 

It  is  characteristic  of  the  theory  of  Markov  chains  with  general  state 
space  that  there  exists  an  auxiliary  cr-finite  measure  usually  denoted  by  $(•), 
i.e.,  the  state  space  is  a  a-finite  measure  space  (T,,B,<fi).  We  shall  adopt  this 
framework.  We  now  define  the  general  state  annealing  algorithm.  Let  U(*) 
be  a  nonnegative  .B-measurable  function  on  E,  which  we  shall  call  the  energy 
function.  The  goal  is  to  find  a  point  in  E  which  minimizes  or  nearly 
minimizes  U(-).  Let  {Tk}  be  a  sequence  of  positive  numbers,  which  we  shall 
call  the  temperature  schedule.  Let  q(v)  be  a  nonnegative  J5x5-measurable 
function  on  ExE  such  that 

/  q(x,y)  </<dy)  =  1  v  *eE  • 

Now  let  {£k}  be  the  Markov  chain  with  state  space  E  and  1-step  transition 
functions  {Pk(v)}  given  by 

Pk(x,A)  =  /  q(x,y)  sk{x,y)  <£(dy)  +  7k(x)  <5(x,A)  (3.2) 

A 

for  all  x£E  and  A £B,  where 


Sk(x.y) 


U(y)  -  U(x) 

exp 

1 

Tk 

if  U(y)  >  U(x) 
if  U(y)  <  U(x)  , 


Tk(x)  =  1  -  /  q(x,y)  Sk(x,y)  <^(dy)  , 


and  5(x,*)  is  the  unit  measure  concentrated  at  x,  for  all  x,y£E  (note  that 


Fubini’s  Theorem  guarantees  that  Pk(v)  defined  by  (3.2)  is  a  valid  stochastic 
transition  function).  We  shall  denote  by  pk(x,*)  the  density  of  the  <£- 
absolutely  continuous  component  of  Pk(x,’)  and  by  p'1  k^(x,‘)  the  density  of 
the  ^-absolutely  continuous  component  of  P^’*c‘td'(x,').  Note  that  if  E  is  finite 
and  4>[')  is  counting  measure  then  {£k}  is  just  the  finite  state  annealing  chain 
of  Chapter  2  with  q^-  =  q(i,j)  (see  (2.1)).  Hence  we  shall  also  call  {fk}  the 
annealing  chain,  and  the  algorithm  which  simulates  the  sample  paths  of  (£k) 
with  Tk — K)  the  annealing  algorithm. 

The  motivation  behind  the  general  state  annealing  algorithm  is  similar  to 
the  finite  state  case  as  described  in  Chapter  2.  Let 

Q(x,A)  =  /  q(x,y)  o(dy) 


for  all  x£S  and  A&B.  Q(',‘)  is  a  stochastic  transition  function  on  (E,B).  For 
each  dElFlet 

Q(d)(x,A)  =  /  QMxj)  —  /  Q(xd_2,dxd_1)  Q(xd_ltA) 

for  all  xEE  and  A E-B.  The  following  definitions  generalize  the  familiar  finite 
state  definitions.  We  shall  say  that  Q(v)  is  irreducible  if  for  every  xEE  and 
A EB  with  </>(A)  >  0  there  exists  dEIF  such  that  Q^(x,A)  >  0.  We  shall  say 
that  Q(v)  is  symmetric  if  q(x,y)  =»  q(y,x)  for  all  x,y£E.  Suppose  Q(v)  is 
irreducible  and  symmetric,  and  let  be  the  stationary  chain  with  1-step 

(stationary)  transition  function  PT(v)  given  by  the  r.h.s.  of  (3.2)  with 
Tk  =  T,  a  positive  constant.  Suppose  that  0  <  <£(E)  <  oc.  Then  it  can  be 
shown  that  PT(v)  has  aD  invariant  Gibbs  measure  IIT('),  i.e., 

nT(A)  =  /  nT(dx)  pt(x,a)  v  a eB  , 

where 

/  exp[—  U(x)/T]0(dx) 


/  exp(—  U(y)/T)<^(dy) 

This  follows  from  the  detailed  reversibility 

7rT(x)pT(x,y)  =  ^T(y)pT(y,x)  , 

valid  for  <^>xd>-a.e.  x,  y£E,  where  ttT(')  and  pT(x,’)  are  the  densities  of  the  <N 
absolutely  continuous  components  of  FIT(')  and  PT(x/),  respectively. 
Furthermore,  Q(y)  irreducible  and  symmetric  implies  that  {£i^}  is  an 
irreduciblef  (and  aperiodic)  chain  and  if  a  certain  condition  of  Doeblin  [6,  p. 

jA  stationary  chain  is  irreducible  if  its  1-step  (stationary)  transition  function  is 

Irreducible. 


/  /. 


.*4  .*>  . 
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192]  is  satisfied,  then  by  a  version  of  the  Markov  Convergence  Theorem  [6,  p. 
199] 

lim  P{&TGA}  =  nT(A)  V  AGB  .  (3.3) 

k— *oo 

Let  S  be  the  set  of  global  minima  of  U(*),  i.e., 

S  =  {x€E:  U(x)  <  U(y)  Vy€E} 

(assume  S  ^  0  for  the  moment).  Now  for  small  T  we  expect  nT(*)  to  be 
concentrated  near  S.  Like  the  finite  state  case,  the  idea  behind  the  general 
state  annealing  algorithm  is  that  by  choosing  T  =  Tk— *0  slowly  enough  the 
probability  measure  of  actually  becomes  concentrated  near  S. 

Unlike  the  finite  state  case  there  are  some  technical  problems  in  just 
verifying  (3.3).  We  need  to  check  Doeblin’s  condition  and  we  also  need  a 
practical  criterion  to  check  whether  Q(v)  is  irreducible.  These  issues  are 
investigated  in  3.2.  We  will  not  use  (3.3)  in  our  analysis  of  the  annealing 
algorithm  with  time-dependent  temperature  schedule.  However  (3.3)  is  of 
independent  interest  as  it  constitutes  the  theoretical  justification  of  a 
continuous  state  version  of  the  Metropolis  algorithm  which  may  be  used  for 
sampling  from  a  continuous  Gibbs  distribution  (c.f.  [16]). 

In  3.3,  3.4  we  shall  extend  our  result  (Theorem  2.9)  on  the  finite  state 
annealing  chain  visiting  S  with  probability  one  to  the  general  state  case, 
under  essentially  the  condition  that  the  state  space  E  be  a  compact  metric 
space  and  the  energy  function  U(*)  be  continuous. 

3.2.  Ergodicity  of  the  General  State  Annealing  Chain  at  a  Fixed 
Temperature 

In  this  Section  we  shall  discuss  the  ergodicity  of  the  general  state 
annealing  chain  at  a  fixed  temperature.  We  shall  use  the  notation  of  3.1 
except  that  we  will  fix  a  temperature  schedule  Tk  =  T,  a  positive  constant, 
and  suppress  the  dependence  of  the  various  quantities  on  T  and  also  on  the 
time  index  k  whenever  possible.  In  this  notation  we  shall  give  conditions 
under  which 

lim  P{£kEA}  =  n(A)  V  A £B  .  (3.4) 

k— too 

We  have  already  remarked  in  3.1  that  (3.4)  will  hold  if  Q(v)  is  irreducible 
and  symmetric  and  a  certain  condition  due  to  Doeblin  is  satisfied.  Doeblin’s 
condition  will  be  satisfied  if 
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(D)  0  <  </>(£)  <  oo  and  there  exists  e  >  0  such  that  P(x,A)  <  1  —  e 

for  all  x££  and  A EB  with  <f>( A)  <  e. 

Under  suitable  conditions  on  £,  <£(•),  U(*),  and  q(v)  we  shall  verify  (D)  and 
give  a  convenient  characterization  of  the  irreducibility  of  Q(v)-  These  same 
conditions  will  be  used  in  3.4  to  analyze  the  general  state  annealing  chain 
with  time-dependent  temperature  schedule.  We  shall  also  give  an  example  of 
a  class  of  q(v)  which  satisfy  the  stated  conditions. 

Consider  the  following  set  of  conditions: 

(Al)  (£,p)  is  a  compact  metric  space 

(A2)  (Z,B,4>)  is  a  nontrivial  finite  measure  space  with  B  the  Borel 
subsets  of  £ 

(A3)  </>(•)  is  positive  on  open  subsets  of  £ 

(A4)  U(’)  is  continuous 
(A5)  q(v)  is  bounded 

(A6)  q(v)  ls  continuous  on  {(x,y)££x£  :  q(x,y)  >  0} 

(A7)  </>({*})  is  lower  semicontinuous  on  (x££  :  q(x,x)  >  0} 

We  remark  that  not  all  of  these  conditions  will  be  used  to  obtain  every 
result. 

The  following  proposition  deals  with  Condition  (D). 

Proposition  3.1  Assume  that  (Al),  (A2),  (A4),  (A5)  hold.  Then  there  exists 
e  >  0  such  that  P(x,A)  <  1  —  e  for  all  x££  and  A£B  with  </>(A)  <  e. 

Proof  Using  (3.2)  and  (A5)  there  exists  a  constant  cx  such  that 

P(x,A)  <  CjflA)  +  7(x)  V  x££  ,  V  AGB  .  (3.4) 

Now 


=  1  -  J  q(x»y)  s(x»y)  <^(dy) 


<  1  -  /  q(x,y)  exp 


tel  -  U(x)l 

T 


0(dy) 


<  1  -  c 2  /  q(x,y)  <ji>(dy) 


-  1  C2 


V  x££  , 


(3.5) 


for  some  constant  c2  >  0,  since  (Al)  and  (A4)  imply  that  U(*)  is  bounded. 
The  Proposition  now  follows  from  (3.4)  and  (3.5).  □ 

We  next  develop  a  criterion  for  the  irreducibility  of  Q(v)  motivated  by 
the  finite  state  case.  We  shall  say  that  given  states  x  and  y,  x  can  reach  y  if 
there  exists  a  sequence  of  states  x  =  x$,  ...,  xp  =  y  such  that  q(xn,  xn+1)  >  0 
for  all  n  =  0,...,p— 1.  Suppose  that  E  is  finite,  0(* )  is  counting  measure,  and 
qjj  =  q(i,j).  Then  this  definition  reduces  to  that  given  in  Chapter  2.  Now  the 
stochastic  transition  matrix  Q  =  [q^j  is  irreducible  iff  i  can  reach  j  for  all 
iJ€E.  The  following  Theorem  gives  a  similar  criterion  for  the  stochastic 
transition  function  Q(x,A)  =  J  q(x,y)  0(dy). 

A 

Theorem  3.1  Assume  that  (Al)-(A3),  (A6)  hold.  Then  Q(v)  is  irreducible 
iff  x  can  reach  y  for  all  x£E  and  0- a.e.  y£E. 

Proof  Suppose  that  Q(v)  is  irreducible  and  there  exists  x£E  and  A £B  with 
0(A)  >  0  such  that  x  cannot  reach  y  for  all  y£A.  Then  there  exists  a  d£N 
such  that  Q^(x,A)  >  0,  and  by  Fubini’s  Theorem 

Q(d)(x,A)  =  /  Q(x,dx!)  —  /  Q(xd_2,dxd_1)  Q(xd_1,A) 

=  /  q(x.*i)  fldxi)  —  /  q(xd_2,xd_1)  0(dx f  q(xd_I,xd)  0(dxd) 

A 

=  /  q(x,xx)  •  —  •  q(xd_i,xd)  0d(dxfdxd) 

>  0  .  (3.6) 

Hence,  q(x,x1),...,q(xd_1,xd)  >  0  for  some  x1,...,xd_1£E  and  xd6A,  and  so  x  can 
reach  some  y€A,  a  contradiction. 

Conversely,  suppose  that  x  can  reach  y  for  all  x£E  and  <fi- a.e.  y£E.  We 
first  show  that  given  e  >  0  there  exists  a  compact  CCE  with  0(C)  >  0(E)  —  e 
such  that  x  can  reach  y  for  all  xGE  and  y£C.  Let  B£B  such  that 

0(B)  =  0(E)  and  x  can  reach  y  for  all  x£E  and  y£B.  Recall  that  a  Borel 
measure  is  regular  if  given  5  >  0  and  a  Borel  set  F  there  exists  a  compact 

set  K  and  an  open  set  G  such  that  KCFCG  and  n(G)  —  /i(K)  <  6.  It  is 

known  that  finite  Borel  measures  on  compact  metric  spaces  are  regular  (c.f. 
[28,  Ch.  14]  for  a  discussion  of  these  matters).  Hence  by  (Al),  (A2)  0(‘)  is  a 
regular  Borel  measure  and  so  there  exists  a  compact  CCB  such  that 

0(C)  >  0(B)  —  €  >  0(E)  —  e  and  necessarily  x  can  reach  y  for  all  x£E  and 
y€C. 


We  next  show  that  there  exists  a  djEN  such  that  x  can  reach  y  in  not 
greater  than  steps  for  all  x£E  and  y£C.  By  (A6)  if  x  can  reach  y  in  d(x,y) 
steps  then  there  exists  neighborhoods  Ux  of  x  and  Vy  of  y  such  that  u  can 
reach  v  in  d(x,y)  steps  for  all  u£Ux  and  vEVy.  Now  {Uxx(VynC)  :  xEE,  yEC} 
is  an  open  cover  of  compact  ExC  (in  the  relative  topology)  and  so  there  exists 
Xi,...,xN  €  £  and  y1(...,yN  G  C  such  that 

N 

ExC  C  U  Ux  xVy  . 

n— 1  D 

Let 

dt  =  max  d(xn,yn)  . 

Now  fix  xEE  and  A £B  such  that  0(A)  >  0.  Ultimately  we  want  to  show 
that  there  exists  d£3I  such  that  Q^(x,A)  >  0.  If  0(A)  =  0(E)  then 
Q^(x,A)  =  1  for  all  d£EK  So  assume  that  0  <  0(A)  <  0(E).  The  next  step  is 
to  show  that  there  exists  d2£lEand  DEB  with  DGA  and  0(D)  >  0  such  that  x 
can  reach  y  in  d2  steps  for  all  y£D.  Choose  0  <  e  <  0(E)  —  0(A)  in  the 
definition  of  C  above.  Then  0(Cfl A)  >  0(E)  —  0(A)  —  £  >  0  and  x  can  reach 
y  in  not  greater  than  dx  steps  for  all  yECPlA.  For  n  =  let 

Cn  =  {yECPlA  :  x  can  reach  y  in  n-steps} 

Then  CnEB  for  n  =  1,...^  and 

d. 

U  Cn  =  CHA  . 

n— 1 

Hence  since  0(CRA)  >  0  we  may  choose  d2E{l,...,d1}  such  that  0( Cd2)  >  0. 
Let  D  =  Cdj. 

Let  d  =  d2.  By  one  additional  application  of  Fubini’s  Theorem  to  (3.6) 

Q(d)(x,A)  >  Q(d)(x,D)  =  /  f(y)  0(dy)  (3.7) 

D 

where 

f(y)  =  /  q(x»xi) . q(xd-i>y)  0d_1(  ’xi  ***  dxd-i)  • 

Since  f(’)  is  a  immeasurable  function  on  E  and  0(D)  >  0,  if  we  can  show  that 
f(’)  is  positive  on  D  then  by  (3.7)  Q(d)(x,A)  >  0  and  we  are  done.  We  now 
show  that  f(*)  is  indeed  positive  on  D.  Fix  yED  and  let 
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q(xi,...,Xd-i)  =  qfc.xj . q(xd_1(y)  V  x1,...,xd_1GS  . 

Then 

f(y)  =  /  q(*i»...,xd_i)  <^<1-1(dx1  —  dxd_!)  . 

Since  x  can  reach  y  in  d  steps  there  exists  Xj,...,xd_iG£  such  that 
q(x1,...,xd_1)  >  0.  Using  (A6)  there  exists  neighborhoods  Bn  of  xn, 
n  =  l,„.,d— 1,  such  that  q(*)  is  positive  on  BiX  •••  xBd_!.  Since  q(*)  is  a  Bd~l- 
measurable  function  on  Ed+1  and  (^^(BjX  •••  xBd_j) 

=  ^(Bj . </>(B d_i)  >  0  by  (A3),  we  have  that  f(y)  >  0,  and  since  yGD  was 

chosen  arbitrarily,  we  have  f(*)  is  positive  on  D  as  required.  □ 

We  end  this  Section  by  giving  an  example  of  a  class  of  q(y)  which  have 
the  property  that  the  corresponding  annealing  chain  makes  “small”  moves  in 
a  topological  sense.  This  is  consistent  with  the  approach  taken  in  the  finite 
state  case  as  discussed  in  Chapter  1.  Of  course  if  £  is  a  metric  space  than 
the  notion  of  smallness  is  well-defined.  We  construct  a  function  q(v)  as 
follows.  Assume  that  (A1)-(A3)  hold,  and  let  p(v)  and  R(*)  be  positive 
continuous  functions  on  £x£  and  E,  respectively.  Let 

q(x,y)  =  c(x)  p(x,y)  XB(x  R(x))(y)  V  x,yG£  ,  (3.8) 

where 

c(x)  =  /  p(x,y)  <£(dy)  V  xGE  . 

B(*,R(x)) 

Note  that  if 

/  p(x,y)  <f>{ dy)  =  1  V  xGE  , 

and  £  is  a  random  variable  which  density  p(x,*)  with  respect  to  <{>(•),  then 
q(x,‘)  is  a  density  for  the  conditional  distribution  of  £  given  £GB(x,R(x)).  The 
following  proposition  establishes  that  q(v)  satisfies  (A5),  (A6). 

Proposition  3.2  Suppose  that 

<?K{y€E  :  ^(x,y)  =  R(x)})  =0  V  xGE  .  (3.9) 

Then  q(v)  is  bounded  and  continuous  on  {(x,y)GEx£  :  q(x,y)  >  0}. 


\  \  *. 

/  .*  r  .* 


.v  /.  /.  % 


Proof  Let 


f(x,y)  =  p(x,y)  XB(XiR(x))(y)  V  x,y£E  , 


so  that 

q(x,y)  =  c(x)  f(x,y)  v  x,ye£  , 

and 

c(x) =  (/  f(x>y)  ^(dy))~  v  x££  • 

Using  the  continuity  of  p(y)  R(‘)>  and  p(v)  we  have  that  f(y)  is  a  continuous 
function  on  {(x,y)£ExE  :  p(x, y)  ^  R(x)}  and  hence  on 

{(x,y)£ExE  :  q(x,y)  >  0}.  We  now  show  that  c(*)  is  a  continuous  function  on 
E.  Let  xGE  and  {xn}  be  a  sequence  in  E  such  that  xn — ►x.  Then 
f(xn,y)— *f(x,y)  for  all  y€E  such  that  p(x, y)  ^  R(x).  Hence  by  (3.9) 
f(xn,y) — *f(x,y)  for  4>- a.e.  y€E,  and  by  the  Dominated  Convergence  Theorem 
c(xn) — *c(x).  Since  x  and  {xn}  were  arbitrary,  c(*)  is  continuous.  The 
Proposition  follows.  □ 

Remark  If  E  is  a  subset  of  KT,  <£(•)  is  Lebesgue  measure  and 
P(x,y)  —  |y  —  x|  then  (3.9)  is  of  course  satisfied. 


з. 3  Asymptotic  Analysis  of  a  Class  of  Nonstationary  Markov  Chains 

In  this  Section  we  analyze  certain  asymptotic  properties  of  a  class  of 
nonstationary  Markov  chains.  These  chains  have  the  property  that  their  1- 
step  transition  probabilities  satisfy  bounds  similar  to  those  satisfied  by  the  d- 
step  transition  probabilities  of  the  annealing  chain.  The  results  of  this 
Section  will  be  used  in  3.4  to  deduce  corresponding  asymptotic  properties  of 
the  annealing  chain. 

We  shall  consider  the  following  class  of  Markov  chains.  Let  (E,p)  be  a 
compact  metric  space  and  a  finite  measure  space  with  B  the  Borel 

subsets  of  E  and  4>{’)  positive  on  the  open  subsets  of  E.  Let  »(y)  be  a  [0,oo]- 
valued  upper  semicontinuous  function  on  ExE  and  {^}  a  sequence  of  real 
numbers  with  0  <  <  1.  Let  {£iJ  be  a  Markov  chain  with  state  space  E 

and  1-step  transition  functions  {P]c(*,*)}  whose  (^absolutely  continuous 
components  have  densities  {pt(v)}  "with  the  following  property:  for  every 

и. vGE  there  exists  a  neighborhood  Bu  v  of  (u,v)  in  ExE  and  a  positive  number 
K(u,v)  such  that 


■N? 


(3.10) 


Pk(x>y)  >  K(u,v)  v  (x,y)GBuv  . 

Note  that  we  do  not  assume  there  exist  a  positive  number  A  such  that 

Pk(x.y)  >  A  0£(x'y)  V  x,yGE  ,  (3.11) 

which  is  similar  to  (2.13).  Of  course  if  S  is  finite  and  4>{')  is  counting 
measure,  then  we  do  obtain  (3.11). 

The  following  theorem  gives  sufficient  conditions  under  which  {4}  visits 
an  open  subset  of  E  infinitely  often  with  probability  one. 

Theorem  3.2  Let  Y  be  an  open  subset  of  E  and 

a  =  sup  inf  ot(x, y)  <  oo  . 
xg£\Y  ygY 

Suppose  there  exists  e  >  0  such  that 

OO 

2  0k+<  =  oo  .  (3.12) 

k— 1 

Then  P{4GY  i.o.}  =  1. 

Remark  If  E  is  finite  and  <fi(‘)  is  counting  measure  we  obtain  Theorem  2.4 
modulo  the  factor  of  e  in  (3.12)  as  compared  with  (2.40).  However  Theorem 
3.2  cannot  be  proved  by  the  simple  argument  used  to  prove  Theorem  2.4, 
essentially  because  we  assume  only  (3.10)  and  not  (3.11). 

We  will  need  the  following  two  lemmas  for  the  proof  of  Theorem  3.2. 

Lemma  3.1  Let  c  >  0.  Then  there  exists  a  nonnegative  lower 

semicontinuous  function  L(v)  on  ExE  with  L(x,y)  >  0  whenever  a(x,y)  <  c 
and 

Pk(x»y)  >  L(x»y)  v  x,yeE . 

Proof  Let  U  =  {(u,v)GExE  :  c*(u,v)  <  c}  which  is  an  open  subset  of  ExE 
since  c*(v)  *1S  upper  semicontinuous  on  ExE.  Now  by  (3.10) 

Pk(x»y)  >  K(u,v)  ,  V(x,y)GBu  v,  V(u,v)GU  . 

Let 

Ki(x*y)  =  .  sup  K(u,v)  V  (x,y)GU  . 

(u,v)€U  : 

BUly3(x,y) 


It  follows  that 


and 


Pk(x.y)  >  K1(x,y)  et  v  (x,y)EU  , 


(3.13) 


inf  K^u.v)  >  K(x,y)  >0  V  (x,y)6U  .  (3.14) 

(u.vjeBxj, 

Let  K2(v)  be  the  lower  envelope  of  K|(v)»  i.e., 


K2(x,y)  =  sup  inf 

6>0  0<ft(ulx)+fi(v,y)<S 


Ki(u,v) 


V  x,y€U  . 


Then  (c.f.  [l])  K2(v)  is  a  lower  semicontinuous  function  on  U  and 

K2(x,y)  <  Kt(x,y)  for  all  (x,y)£U.  Also  (3.14)  implies  that  K2(y)  is  positive. 
Let 


L(x,y)  = 


K2(x,y) 

0 


if  (x,y)€U 
if  (x,y)£U  , 


for  all  x.yGE.  Since  U  is  open  and  K2(v)  is  a  positive  lower  semicontinuous 
function  on  U,  L(v)  is  a  lower  semicontinuous  function  on  ExE  which  is 
positive  on  U.  Furthermore 


Kx(x,y)  >  K2(x,y)  =  L(x,y)  V  (x,y)EU  .  (3.15) 

The  Lemma  now  follows  from  (3.13)  and  (3.15).  □ 


Lemma  3.2  Let  Y  be  an  open  subset  of  E  and 

a  =  sup  inf  cdx.y)  . 

x€E\Y  yeY 

Let  e  >  0  and  L(v)  be  a  lower  semicontinuous  function  on  ExE  such  that 
L(x,y)  >  0  whenever  o;(x,y)  <  a  +  e.  Then  there  exists  open  sets 
contained  in  Y  such  that 


sup  min  sup 
x€£\Y  y€Wm 


ct(x,y)  <  a  +  e  , 


inf  max  inf  L(x,y)  >  0  . 
x€£\Y  y6Wm 


Proof  Let  X  =  E\Y.  We  first  show  there  exists  a  relatively  open  cover 
Ui,...»Un  of  X  and  open  sets  VN  contained  in  Y  such  that  c*(x,y)  <  a  +  e 
for  all  x£Un,  y€Vn,  and  n  =  1,...,N.  For  every  X0C  there  exists  a  y£Y  such 
that  Qr(x,y)  <  a  +  e,  and  since  Q:(•,•)  is  upper  semicontinuous  there  exists 
neighborhoods  Ax  of  x  and  Bx  of  y  such  that  a(u,v)  <  a  4-  e  for  all  uEAx  and 
vEBx.  Since  {A^flX  :  xEX}  is  an  open  cover  of  compact  X  (in  the  relative 
topology),  there  exists  x^^^Xj^EX  such  that 


s> 
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X  C  U  A,  , 

D  —  1 

Let  Un  =  A^HX  and  Vn  =  B^flY  for  n  = 

We  next  show  there  exists  a  8  >  0  such  that  for  every  x£X  there  exists  a 
y€Y  and  an  nG{l,...,N}  such  that  x£Un,  yGVn,  and  L(x,y)  >  6.  Let 

fn(x)  =  sup  L(x,y)  v  xGX  ,  V  n  = 
yev„ 


f(x)  =  max  fn(x)  V  xGX  . 

n-l,...,N  : 

U„3x 

Since  L(v)  is  lower  semicontinuous,  are  lower  semicontinuous 

functions  on  X,  and  since  {Uj,...,UN}  are  open  in  X,  f(')  is  a  lower 
semicontinuous  function  on  X.  Now  L(x,y)  >  0  whenever  c*(x,y)  <  a  +  e  and 
in  particular  when  xGUn  and  ycVn  for  some  n  =  It  follows  that  f(-)  is 

positive.  Hence  since  f(-)  is  a  positive  lower  semicontinuous  function  on 
compact  X  we  can  choose 

0  <  6  <  inf  f(x)  . 
t€X 

Combining  the  above  results,  for  every  xGX  there  exists  a  yGY  such  that 
a(x,y)  <  a  +  e  and  L(x,y)  >  8.  We  can  now  find  similarly  to  the  construction 
of  Uj,...,U]v  and  Vj,...,VN  above  a  cover  Ui,...,UM  of  X  and  open 
contained  in  Y  such  that  ct(x,y)  <  a  +  e  and  L(x,y)  >  8  for  all  x£EUm,  y£Vm, 
and  m  =  Let  Wm  =  Vm  for  m  =  to  complete  the  proof  of  the 

Lemma.  □ 

Proof  of  Theorem  3.2  Let  X  =  £\Y.  From  Lemmas  3.1  and  3.2  there 
exists  a  8  >  0  and  open  sets  Wj,...,WM  contained  in  Y  such  that  for  every 
xGX  there  exists  an  mG{l....,M}  such  that 

Pk(x.y)  >  8  V  yGWm.  (3.16) 

Using  (3.16)  and  the  Markov  property 


(3.16) 


.  - .  c>.v.v’^.v  ••.v.v  v  v,v> -•  v  .  . 


•?w 


p  n  teex}  <  p{{Bex}  n  P{fk+i£xl4  =  *} 

k-«  k-m 

<  *n  [l  -  inf  PlCk^iCTlCk  »x}| 

<  n  1  -  /  Pk(x,y)  <£(dy) 

k-m  x€X  Y 

<  n  [l  —  A  0k +<  j  Vn  >  m  , 

k-m 

where  A  =  5  •  min  $(Wm)  >  0  (since  <£(•)  is  assumed  positive  on  open 
subsets  of  £).  Hence 

0°  j-.. 

P  n  {£kex}  <  n  1  -  A  6Z+l  =0  V  m  , 

k“m  k-m  L  J 

where  the  divergence  of  the  infinite  product  follows  from  the  divergence  of  the 
infinite  sum  (3.12),  and  the  Theorem  follows.  □ 

3.4  Convergence  of  the  General  State  Annealing  Algorithm 

In  the  Section  we  apply  the  results  of  3.3  to  obtain  certain  asymptotic 
properties  of  the  general  state  annealing  algorithm.  Throughout  this  Section 
(3.4)  we  shall  use  the  notation  introduced  in  3.1.  We  shall  also  refer  to 
conditions  (Al)-(A7)  given  in  3.2. 

3.4.1  Bounds  on  Transition  Probabilities  for  the  General  State 
Annealing  Chain 

In  order  to  apply  the  results  of  3.3  we  need  to  obtain  a  bound  on  the  d- 
step  transition  density  p(k'k+d)(y)  of  the  (^-absolutely  continuous  component 
of  the  d-step  transition  function  p(k-k+d)(v)  of  the  annealing  chain  {£k}. 
Toward  this  end  we  make  the  following  definitions.  For  every  x,y££  and  d£3J 
let  Ad(x,y)  be  the  subset  of  £d+1  such  that  (x  =  x0,„.  ,xd  =  y)GAd(x,y)  if  for 
every  n  =  0,...,d— 1  one  of  the  following  is  true: 

(i)  xn+i  *  xn  and  q(xn,xn+1)  >  0 

(ii)  xa+1  =  xn  and  q(xn,xn+1)  >  0,  (£({xn})  >  0 
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(iii)  xn+1  =  xn  and  q(xn,z)  >  0  for  some  zG^  with  U(z)  >  U(xDj. 

The  following  proposition  gives  an  alternative  characterization  of  A<j(v)- 

Proposition  3.3  Assume  (Al)-(A6).  Let  x,yGE  and  dGN  Then 
(x  =  Xq,...^  =  y)€Ad(x,y)  iff  there  exists  a  version  of  pk(v)  such  that 

max  {pk(xn,xn+1),  Pk(xn,{xn+1})}  >0  V  n  =  0,...,d-l  . 

Proof  By  the  Radon-Nikodym  Theorem  and  (3.2) 

Pk(x,A)  =  /  pk(x,y)  <£(dy)  +  Pk(x,A) 

A 

-  /  q(x,y)  sk(x,y)  <£(dy)  +  ^(x)  <5(x,A) 

A 

for  all  xEE  and  AGP,  where  <£{*)  and  P^x/)  are  mutually  singular.  Hence 

Pk(x»y)  =  q(x>y)  sk(x>y)  +  *{*}(y) 

for  all  xG£  and  0-a.e.  y£E,  and 

Pk(xf{y})  =  q(x.y)  sk(x,y)  <£({y})  +  7k(x)  x{x,(y)  (3.17) 

for  all  x,y£E.  Fix  the  following  version  of  pk(v): 


q(x>y)  sk(x,y) 


Pk(x.y)  =  1  q(x>x)  + 


^(W) 


if  y^x, 

if  y  =  x,  4>{{x})  >  0 
if  y  =  x  ,  0({x})  =  0 


(3.18) 


for  all  x,yG£.  Now  under  (A1)-(A6)  for  every  xG£  7k(x)  >  0  iff  q(x,z)  >  0  for 
some  zGS  with  U(z)  >  U(x).  It  follows  from  (3.17),  (3.18)  and  this  last  remark 
that  (i)-(iii)  hold  iff  pk(xn,xn+1)  >  0  or  Pk(xn,{xn+1})  >  0  for  all  n  =  0,...,d— 1. 
□ 

Suppose  E  is  finite,  <p{’)  is  counting  measure  and  q^  =  q(i,j).  In  view  of 
Proposition  3.3  the  above  definition  of  Ad(v)  reduces  to  that  given  in  Chapter 
2. 

For  each  dGNlet 


V  *-  *•  ^  .N  . 

•■'o  -  "  -w'  -s.-  V 


OH 


Ud(x0,...,xd)  =  £  max{0,  U(xB+1)  -  U(xn)}  , 

n-0 

for  all  x0,...,xdGS,  and 


Vd(x,y) 


.  Ud(X)  if  y  ^  x 
^€Ad(*,y) 

oc  if  y  =  x  , 


V(x,y)  =  inf  Vd(x,y)  , 
d 


(3.19) 


for  all  x,yE£.  We  shall  call  Vd(x,y)  the  d -step  transition  energy  from  x  to  y, 
and  V(x,y)  the  transition  energy  from  x  to  y.  We  should  like  to  point  out  a 
difference  in  the  definition  of  Vd(v)  here  and  in  Chapter  2  (compare  (3.19) 
and  (2.45)).  Here  we  set  Vd(x,x)  =  oc;  see  the  remark  following  Proposition 
3.4  for  an  explanation. 

We  first  prove  that  the  d-step  transition  energy  (and  hence  the  transition 
energy  itself)  is  an  upper  semicontinuous  function. 

Proposition  3.4  Assume  (Al)-(A6).  Then  Vd(v)  is  an  upper 

semicontinuous  function  from  SxS  into  [0,ooJ. 

Proof  Let  x,yE£  such  that  Vd(x,y)  <  oo,  and  let  e  >  0.  From  (3.19)  we 
have  that  y/x  and  there  exists  X0\d(x,y)  such  that  Ud(\)  <  Vd(x,y)  +  e/2. 
It  is  clear  that  X  can  be  chosen  such  that  all  of  the  self-transitions  in  X  occur 
consecutively.  We  consider  here  the  following  case  (the  other  cases  are 
similar): 

X  =  (x  =  x0  =  •••  =  xm_i  #  xm  ^  •••  ^  xd  =  y) 

where  1  <  m  <  d  .  Now  q(x,x)  >0  or  q(x,z)  >  0  for  some  zG£  with 
U(z)  >  U(x),  and  q(xn,xn+1)  >  0  for  all  n  =  m— l,...,d— 1.  Hence  by  (A4),  (A6) 
we  can  choose  neighborhoods  Bx  of  x  and  By  of  y  with  BxnBy  =  0  such  that 
for  every  uEBx  and  v£By  we  have  q(u,u)  >  0  or  q(u,z)  >  0,  q(u,xm)  >  0, 
q(xd-i,v)  >  0,  and 

(U(u)  -  U(x)|  +  fU(v)  -  U(y)|  <  j  . 


Now  let  (u,v)EBxxBy  and 


m  times 


Then  crEAd(u,v)  and 

|U.M  -  Ud(X)|  <  |U(u)  -  U(x)|  +  [U(v)  _  U(y)|  <  j 

and  so 

Vd(u,v)  <  Ud(a)  <  Ud(X)  +  ±<  Vd(x,y)  +  £ 

and  consequently  Vd(v)  is  upper  semicontinuous  at  (x,y).  Since  x,y  were 
arbitrary  points  in  E  which  satisfy  Vd(x,y)  <  oo,  Vd(v)  is  upper 
semicontinuous.  □ 

Remark  Let 

Vd(x,y)  =  inf  Ud(\)  V  x.yGE 
*€Ad(*,y) 

so  that  Vd(x,y)  =  Vd(x,y)  for  y  ^  x  but  Vd(x,x)  ^  Vd(x,x)  in  general.  It  is  easy 
to  construct  examples  such  that  Vd(x,y)  is  not  upper  semicontinuous  at  y  =  x. 
We  defined  Vd(x,x)  =  oo  to  avoid  this  problem. 


The  following  theorem  gives  a  lower  bound  on  the  d-step  transition 
probabilities  of  the  annealing  chain  in  terms  of  the  d-step  transition  energy. 


Theorem  3.3  Assume  (Al)-(A7).  Let  {TjJ  be  monotone  nonincreasing  and 
dE®  Then  there  exists  a  version  of  p(k'k+d)(*,*)  with  the  following  property: 
given  e  >  0  for  every  u,vEE  there  exists  a  neighborhood  Bu  v  of  (u,v)  in  ExE 
and  a  positive  number  K(u,v)  such  that 


p(k-k+d)(x,y)  >  K(u,v)  exp 


Vd(u.v)  +  c  ' 

Tk+d-l 


V  (x,y)€Bu  v  . 


(3.20) 


Remark  We  do  not  assert  (nor  do  we  believe  it  is  true  in  general)  that 
there  exists  a  positive  number  A  such  that 


p(k,lc+d)(x^y)  >  £  exp 


Vd(x,y)  +  e 


•  k+d-1 


Vx,y6E  , 


(3.21) 


which  is  similar  to  the  lower  bound  in  (2.48)  for  the  finite  state  case.  Of 
course  if  E  is  finite  and  4>(')  is  counting  measure  than  we  do  obtain  (3.21). 


We  will  need  the  following  lemma  for  the  proof  of  Theorem  3.3. 

Lemma  3.3  Assume  (Al)-(A7).  Then  Pk(‘,{'})  is  a  continuous  function  on 

E. 

Proof  From  (3.2) 

Pk(x,{x})  =  q(x,x)  ${x})  +  7k(x) 

=  1  -  /  q(x,y)  ak(x,y)  <£(dy) 

=  q(x,x)  <£({x})  +  /  q(x,y)  [l  -  sk(x,y)]  <£(dy)  ,  (3.22) 


for  all  xGE.  Let  x€£  and  {xn}  be  a  sequence  in  £  with  xn— *x.  Now  from  the 
second  equality  in  (3.22) 


lim  sup  Pk(xn,{xn})  <  1  -  lim  /  q(xn,y)  sk(xn,y)  <jS»(dy) 

n— *-oc  n—+oc  , 

q(x,y)>0 

=  i  -  /  q(x>y)  Sk(x»y)  <Kdy) 

y«, 

q(x,y)>0 

=  Pk(x,{x})  ,  (3.23) 


where  we  have  used  (Al)-(A6)  and  the  Dominated  Convergence  Theorem  to 
evaluate  the  limit.  Also,  from  the  third  equality  in  (3.22) 


lim  inf  Pk(xn,{xn})  >  lim  inf  q(xn,xn)  <£({xn}) 

n-+oc  n— +00 

+  lim  /  q(xn,y)  [1  -  sk(xn,y)l  <t>{dy) 

n~*°°  q(x,y)>0 

>  q(x,x)  <£({x})  +  lim  /  q(xn,y)  [l  -  sk(xn,y)]  </>(dy) 

n->0°  q(x,y)>0 

=  q(x,x)  ${x})  +  /  q(x,y)  [1  -  sk(x,y)]  <£(dy) 

q(x,y)>0 


=  Pk(x,{x})  , 


(3.24) 


where  we  have  used  (A6),  (A7)  to  obtain  the  second  inequality  and  (Al)-(A6) 
and  the  Dominated  Convergence  Theorem  to  evaluate  the  limit.  Combining 
(3.23)  and  (3.24)  gives 


lim  Pkfo.fo})  =  Pk(x,{x})  , 

n— *-oo 

and  since  x,  {xn}  were  arbitrary  the  Lemma  follows.  □ 

Proof  of  Theorem  3.3 


Let 


ik(*,y)  = 


q(x,y) 

Pk(x,{x}) 


if  y  ¥=  X  , 
if  y  —  x  , 


for  all  x,yG£,  and 


d-l 


^k(x<)>...iX(i)  rk+n(Xn>Xn+i)  , 

n— 0 


(3.25) 


(3.26) 


r(x0,.„,xd)  =  inf  rk(x0,...,xd)  ,  (3.27) 

k 

for  all  Xo,...,xdG£.  If  \GSd+1  then  since  {Tk}  is  nonincreasing  (rk(X)}  is 
nondecreasing  and  so  r(X)  =  rx(X)  obtains  the  infimum.  Note  that  r(X)  >  0 
for  all  X£Ad(x,y),  x,yG£. 

For  every  xG£  define  a  measure  t/{x,*)  on  (£,£)  by 
i/<x,A)  =  <£( A)  +  [1  -  <£({x})]  <5(x, A) 

and  define  a  measure  ^x,*)  on  (Ed+1,Sd+1)  by 

V'(x,A0x**'xAd)  =  f  5{x, dx0)  /  t^xo.dxj)  •••  f  rKxd_u dxd)  . 

Ao  Aj  A<j 

It  follows  from  (3.2)  and  (3.25)-(3.27)  that 

p(k.k+d)(x>A)  >  /  r(X)  exp  -  5<x,dX)  (3.28) 

>>eAd(x.y) ,  Tk+d_1 

j£  A 

for  all  xG£  and  A £B. 

For  every  n  =  l,...,d  and  x,yG£  let 
Mn(x,y)  =  {(x0l...,xd)GAd(x,y)  :  xn_i  #  y  ,  xk  =  y  Vk  =  n,...,d} 

Then  from  (3.28) 

P<kJ‘+d>(x,A)  >  £  ;  r(X)  exp  -  ^ x ,dX)  .  (3.29) 

n-1  X€M„(x,y),  ^k+d-1 

y€A 

For  every  n  =  l,...,d  let  ELM  be  the  projection  map  from  Ed+1  to  En,  and  for 


every  x££  let  iAn(x»‘)  the  image  measure  of  t/)(x,*)  under  nn(-);  also  let 


X'la  =  (x,...,x)  . 

n  copies 


Then  applying  Fubini’s  Theorem  to  (3.29) 

p(k1k+d)(x)A)  >  J  fk(x,y)  4>{dy) 

A 


(3.30) 


where 


d— 1 


fk(x.y)  =  E  I  f(X,y*id-»+i)  exP 

n-l  n^^y) 


U,(X) 


Tk+d-i 

Now  by  the  Radon-Nikodym  Theorem  we  have 

p(k,k+d)(x^  _  j-p(k,k+d)^x^y.j  <f>(dy)  -f  p(k.k+d)(X)  A) 


4(x,dX)  .  (3.31) 


(3.32) 


where  </>(•)  and  p(k-k+d)(x  ,*)  are  mutually  singular.  It  follows  from  (3.30)  and 
(3.32)  that 


/  p(k,k+d)(x,y)  <£( dy)  >  /  fk(x,y)  $dy) 

A  A 


for  all  x£E  and  A £B,  and  so 

p<k'k+d)(x,y)  >  fk(x,y)  (3.33) 

for  all  x€£)  and  <£-a.e.  yGE,  and  consequently  there  ie  a  version  of  p!k,I(+dl(-,*) 
such  that  (3.33)  holds  for  all  x,y££. 

Fix  e  >  0  and  u,v££.  For  each  x,y£E  if  Vd(u,v)  <  oo  let 
Nn(x,y)  =  {X£nnMn(x,y)  :  Un(X)  <  Vd(u,v)  +  e} 
for  n  =  l,...,d,  and  set 


g(x»y)  =  £  I  f(x>y'ld-n+i)  4(x>dX) ; 

n-l  Nn(x,y) 


(3.34) 


if  Vd(u,v)  =  oc  set  g(x,y)  =  1.  Then  from  (3.31) 

Vd(u,v)  +  e 


fk(x»y)  >  s(x-y)  exP 

We  make  the  following 


■  k+d-l 


V  x,y£E 


(3.35) 


>  •.  v  /v  %  /.  %  %  \  V  \  \  *.  *.  •  •/  • 
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Claim  There  exists  a  neighborhood  B  of  (u,v)  in  SxE  such  that 

(  mf  g(*,y)  >  0  . 

Suppose  the  Claim  is  true.  Then  by  setting  Bu  v  =  B  and 

K(u,v)  =  inf  g(x,y) 

and  combining  (3.33)  and  (3.35)  we  obtain  the  Theorem.  It  remains  to  prove 
the  Claim. 

Proof  of  Claim  Assume  Vd(u,v)  <  oo.  From  (3.19)  there  exists  X£Ad(u,v) 
such  that  Ud(\)  <  Vd(u,v)  4-  e,  and  since  r(X)  >  0  there  exists  8  >  0  such  that 
r(X)  >  8.  AJso  from  (3.19)  we  must  have  u  ^  v,  which  implies  there  exists  an 
n€{l,...,d}  such  that  X£Mn(u,v).  It  is  clear  X  can  be  chosen  such  that  all  of 
the  self-transitions  in  X  which  occur  before  the  nth  transition  (which  is  not  a 
self-transition)  occur  consecutively.  We  consider  here  the  following  case  (the 
other  cases  are  similar): 

X  —  (u  =  U0  —  ***  =  Um_x  um  *  —  5*  Un_!  ^  un  =  Un+1  =  •••  =  v) 

where  1  <  m  <  n  <  d.  Using  (A4),  (A6)  and  Lemma  3.3  we  can  choose 
neighborhoods  Bu  of  u,  Bv  of  v,  and  Bk  of  uk  for  k  =  m,...,n— 1  with 
Bn_!  PI  By  —  0,  such  that  for  every  x€Bu  and  y6Bv  we  have 
Ud(o)  <  Vd(u,v)  +  e  and  r(o)  >  8  for  all  <7€{x}m  x  Bm  x  •••  x  Bn_j  x  {y}d_n+1. 
Let 

°x,y  =  (x}m  xBmx  —  X  Bn_!  V  x,y6S  , 

and  B  =  Bu  x  By.  Then  for  every  (x,y)£B  we  have  0X  yCNn(x,y)  and  hence 
from  (3.34) 

g(x.y)  >  /  8  0n(x,dX) 

>  5  <£(Bm) . 4>{ Bn) 

>  0 


by  (A3).  This  proves  the  Claim  and  hence  the  Theorem.  □ 


3.4.2  Visiting  of  Neighborhood  of  the  Set  of  Global  Minima  with 
Probability  One 

We  now  give  a  theorem  whichs  gives  conditions  such  that  annealing 
chain  {£k}  visits  a  given  neighborhood  of  S  infinitely  often  with  probability 
one.  Let  e  >  0  and 

S,  =  {x<E£  :  U(x)  <  inf  U(y)  +  e} 

ye£ 

To  avoid  trivialities  we  will  need  the  following  assumption: 

(P)  Every  iG£\S{  can  reach  some  j€S(. 

Let 

V*  =  sup  inf  V(x,y)  . 

x€E\S,  y€St 

Note  that  under  (Al)-(A4)  and  (A6)  (so  that  £\Sf  is  compact  and  by 
Proposition  3.4  V(v)  is  upper  semicontinuous)  (P)  holds  iff  V(  <  oo. 

Theorem  3.4  Assume  (Al)-(A7)  and  (P).  Let  {Tk}  be  monotone 
nonincreasing  and 


°o  Vt*  +  <5  ' 

E  exP - ~ -  =  °° 


(3.36) 


for  some  5  >  0.  Then  P{^kGS<  i.o.}  =  1  for  all  e  >  0. 

Proof  We  first  show  that  there  exists  dEBfsuch  that 

v;  -  ,1%.  'll,  Y*[x'y)  -  I  (3-37) 

For  every  xE£\S(  there  exists  a  d(x)E3!such  that 

■^1  vd(x)(x>y)  <  v(x>y)  +  7  <  V*  +  ~  . 

y€S<  y€S<  l  & 

But  by  Proposition  3.4  for  every  xE£\S(  Vd(x)(v)  is  an  upper  semicontinuous 

function  on  £x£  and  so  inf  V<(  x)(*,y)  is  an  upper  semicontinuous  function  on 

y€S« 

£,  and  consequently  there  exists  a  neighborhood  Bx  of  x  such  that 

in|  vd(x)  (u,y)  <  v<*  +  7  V  UEBX . 
y€S,  i 

Now  (Bxn(£\SJ  :  xE£\S,}  is  an  open  cover  of  compact  £\S(  (in  the  relative 
topology)  and  so  there  exists  x1,...,xnE£\S(  such  that 


v/.y.  >.y.;  •*././. 


vvv 


w  v»'jtvr#»v 


AS,  c  UA. 

fcn-l 

Let  d*  =  max  d(xn).  Now  it  is  easy  to  see  that  for  every  x£E 

n— 

inf  Vn(x,y)  <  inf  Vm(x,y)  V  n  >  m  . 
yes,  yes, 

Hence  for  every  x£E\S, 

inf  Vd-(x,y)  =  min  inf  VB(x,y)  <  V,*  + 
y6S<  n<d  y€S,  2 

and  (3.37)  follows  by  setting  d  =  d*. 

Next,  from  Theorem  3.3  for  every  u,v£E  there  exists  a  neighborhood  Bu 
of  (u,v)  in  ExE  and  a  positive  number  K(u,v)  >  0  such  that 

P(k,lt+d)(x.y)  >  K(u,v)  exp  -  -Vd^’V)  —  —  v  (x,y)GBu  v  . 

1k+d-l 


£k  =  £kd.  ^k  =  exp 


■  kd+d-l 


a(x,y)  =  Vd(x,y)  +  -  V  x,y€E  . 


(3.38) 


Then  {£k}  *s  a  Markov  chain  with  1-step  transition  functions  (Pk(v)}  whose 
(^-absolutely  continuous  components  have  densities  (Pk(v)}  which  satisfy 

Pk(x.y)  >  K(u,v)  0k  (u'y)  V  (x,y)€BU  T,  V  u,v£E  . 

Let 

a  =  sup  inf  a(x,y)  . 
ieiAS,  yes, 

By  (3.37)  and  (3.38)  a  <  V,  4-  5.  Hence  since  {Tk}  is  nonincreasing  the 
divergence  of  the  series  in  (3.36)  implies  that 

£  *k  -OO. 

k— 1 

Hence  we  may  apply  Theorem  3.1  to  {£k}  with  Y  =  S,  to  get 
P{£k€S(  i.o.}  =  1  and  so  P{£k€Sf  i.o.}  =  1.  □ 


>  /  >  A‘.V 


{ wvwv  w  ywvwy  "WV1 


CHAPTER  IV 

DIFFUSION  TYPE  ALGORITHMS 


4.1  Introduction  to  the  Langevin  Algorithm 

In  Chapter  2  we  discussed  the  annealing  algorithm  proposed  by 
Kirkpatrick  et.  al.  [19]  and  Cerny  [3]  for  combinatorial  optimization.  In 
Chapter  3  we  extended  the  annealing  for  optimization  on  general  spaces. 
Motivated  by  image  processing  problems  with  continuous  variables,  Geman 
and  independently  Grenander  [13]  have  recently  proposed  using  diffusions  for 
optimization  on  multidimensional  Euclidean  space.  In  this  Section  we 
describe  this  method.  Like  the  annealing  algorithm,  this  approach  to  global 
optimization  has  generated  alot  of  interest  and  there  already  exists  a 
significant  literature  on  the  subject. 

Let  U(*)  be  a  nonnegative  continuously  differentiable  function  on  If.  The 
goal  is  to  find  a  point  in  If  which  minimizes  or  nearly  minimizes  U(*).  Let 
T(*)  be  a  positive  Borel  function  on  [0,oo).  As  with  the  annealing  algorithm 
we  shall  refer  to  U(*)  as  the  energy  function  and  T(’)  as  the  temperature 
schedule.  Let  w(‘)  be  a  standard  r-dimensional  Wiener  process  and  let  x(*)  be 
a  solution  of  the  stochastic  differential  equation 

dx(t)  =  —  VU(x(t))dt  +  "\/2T(t)  dw(t)  ,  t  >  0  ,  (4.1) 

for  some  initial  condition  x(0)  =  x0  (by  a  solution  we  mean  that  x(’)  is  a 
separable  process  with  continuous  sample  paths  with  probability  one,  x(')  is 
nonanticipative  with  respect  to  w(*),  and  x(’)  satisfies  the  Ito  integral 
equation  corresponding  to  (4.1)).  For  a  fixed  temperature  T(t)  =  T  >  0,  (4.1) 
is  the  Langevin  equation,  proposed  by  Langevin  in  1908  to  describe  the 
motion  of  a  particle  "id  a  viscous  fluid.  Geman  and  Grenander  suggested  that 
(4.1)  could  be  used  to  minimize  U(*)  by  letting  T(t)— ►€).  Following  Gidas’  [  1 1  ] 
notation,  we  shall  call  the  algorithm  which  simulates  the  sample  paths  of  x(*) 
with  T(t) — *0  the  Langevin  algorithm. 

The  motivation  behind  the  Langevin  algorithm  is  similar  to  that  of  the 
annealing  algorithm.  Let  xT(*)  be  the  solution  of  (4.1)  with  T(t)  =  T,  a 
positive  constant,  and  let  PT(v,*)  be  its  (stationary)  transition  function,  i.e., 


A  AAA  AA 


v  >»»•/>: 


•  for  every  t  >  0  and  A£Br  PT(t,#,A)  is  a  Borel  function  on  Rr 

•  for  every  t  >  0  and  xGR1^  PT(t,x  ,*)  is  a  probability  measure  on 

(Kr,Br) 

•  PT(t,x,A)  =  f  PT(s,x,dy)  PT(t— s,y,A)  for  all  0  <  s  <  t,  x£Rr,  and 
A£Br 

•  P{xT(t)£A(xT(s)}  =  PT(xT(s),A)  w.p.l  for  all  0  <  s  <  t  and  A£Br 
Under  certain  conditions  (c.f.  [31]),  PT(*,v)  has  an  invariant  Gibbs  measure 

nT(-),  i.e., 

nT(A)  =  /  nT(dx)  PT(t,x,A)  V  t  >  0  ,  V  A£Br  , 

where 

J  exp(—  U(x)/T)  dx 

- - — -  V  A£Br  , 

/  exp(—  U(y)/T)  dy 

and  furthermore 

P{xT(t)£’}— ►nT(*)  weakly  as  t— *oc  .  (4.2) 

Now  for  suitable  U(*) 

nT(-) — ►n  (*)  weakly  as  T— (4.3) 

where  II*(*)  is  a  probability  measure  on  (Rf,Br)  with  support  in  the  set  S  of 
global  minima  of  U(-);  see  [17]  for  conditions  under  which  (4.3)  holds  and  a 
characterization  of  11  (■)  in  terms  of  the  Hessian  of  U(*).  In  view  of  (4.2)  and 
(4.3)  the  idea  behind  the  Langevin  algorithm  is  that  by  choosing  T  =  T(t)— >0 
slowly  enough  hopefully 

P{x(t)0}  ~  nTW(-)  (t  large) 

and  then  perhaps 

P{x(t)6*} — ►II  (•)  weakly  as  t — >cc  (4.4) 

and  consequently  x(t)  converges  to  S  in  probability. 

The  Langevin  and  the  annealing  algorithms  both  have  a  stochastic 
descent  behavior  whereby  “downhill”  moves  are  modified  probabilistically  by 
“uphill”  moves  with  fewer  and  fewer  uphill  moves  as  time  tends  to  infinity 
and  temperature  tends  to  zero.  However,  the  simulations  of  these  Monte 
Carlo  algorithms  are  quite  different.  To  simulate  sample  paths  of  x(*)  we 
might  discretize  (in  time)  the  Langevin  algorithm  as 
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xk+1  =  xk  “  VU(xk)e  +  V2T(keje  wk  ,  (4.5) 

where  {wk}  is  a  sequence  of  standard  Revalued  Gaussian  random  variables 
and  e  is  a  (positive)  discretization  interval,  and  simulate  sample  paths  of  {xk} 
by  generating  pseudorandom  Gaussian  variates.  VU(*)  may  be  computed 
from  an  analytical  formula  or  approximated  in  a  standard  fashion.  Compare 
this  simulation  with  that  of  the  annealing  algorithm  (see  Chapter  2). 

Geman  reports  some  encouraging  numerical  results  have  been  obtained 
by  Aluffi-Pentini  et.  al.  [32]  with  a  modified  Langevin  algorithm  which  uses  an 
interactive  temperature  schedule.  Tests  have  been  run  on  U(*)  defined  on  Rr 
with  r  =  Gidas  also  reports  a  numerical  experiment  with  a  single  U(-) 

defined  on  E  with  400  local  minima.  He  suggests  that  a  combination  of  the 
Langevin  algorithm  with  the  popular  multistart  technique  (c.f.  [29])  might 
improve  the  performance  obtained  by  using  either  approach  alone.  We 
remark  here  that  comparing  different  global  optimization  algorithms  is  in 
general  a  very  difficult  problem.  Rubenstein  [29]  discusses  some  analytical 
methods  for  comparing  different  algorithms.  Dixon  and  Szego  [5]  have 
attempted  to  define  a  standard  set  of  test  functions  which  might  be  used  to 
empirically  compare  different  algorithms.  It  is  not  clear  that  either  of  these 
methods  are  suitable  for  evaluating  the  performance  of  the  Langevin 
algorithm.  These  tools  it  seems  were  designed  to  compare  algorithms  which  in 
some  way  take  advantage  of  the  structure  of  smooth  functions  on  low 
dimensional  spaces.  We  regard  the  Langevin  algorithm  as  a  “universal” 
algorithm  which  may  be  used  on  functions  defined  on  high  dimensional  space 
whose  structure  is  essentially  unknown  or  cannot  be  simply  characterized.  It 
seems  that  the  best  test  for  the  Langevin  algorithm  is  the  particular  problem 
one  wishes  to  solve. 

We  shall  now  outline  those  convergence  results  for  the  Langevin 
algorithm  which  are  known  to  us.  We  refer  the  reader  to  the  specific  paper 
for  full  details. 

Geman  and  Hwang  [9]  were  the  first  to  obtain  a  convergence  result  for 
the  Langevin  algorithm.  They  consider  a  version  of  the  Langevin  algorithm 
restricted  to  a  compact  subset  of  E1^  (using  reflection  barriers).  They  show 
that  for  a  temperature  schedule  of  the  form 


T(t)» 


(t  large) 


that  if  c  is  no  smaller  than  the  difference  between  the  maximum  and 
minimum  values  of  l’(')  then  (4.4)  is  obtained. 


N  \  *. 


\  V  \  %  V 


a’aa  * 
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Gidas  [l  1  ]  has  obtained  necessary  and  sufficient  conditions  for  the 
convergence  of  the  Langevin  algorithm  in  all  of  E1,  using  partial  differential 
equation  methods.  He  shows  that  there  exists  a  constant  A  such  that  for 
temperature  schedules  T(t)|0,  (4.4)  holds  iff 


OC 

/  exp 

o 


T(t) 


dt  =  oc 


Furthermore,  the  constant  A  is  the  natural  continuous  analog  of  Hajek’s 
constant  (see  (2.10)).  Chiang  et.  al.  [4]  have  also  obtained  sufficient 
conditions  for  the  convergence  of  the  Langevin  algorithm  in  all  of  1^  using 
large  deviations  theory. 


Kushner  [21]  has  obtained  a  detailed  picture  of  the  asymptotic  behavior 
of  a  class  of  diffusions  related  to  the  Langevin  algorithm  and  certain  discrete¬ 
time  approximations  as  well.  Kushner  considers  (in  discrete-time)  an 
algorithm  of  the  form 

Xk+l  =  Xk  -f  akb(Xk,£k)  +  V2  ako(Xk)wk  (4.6) 


where  {£k}  is  a  sequence  of  bounded  random  variables  and 


ai  =  i^T  (k  large) ' 

In  the  special  case  where  b(*)  =  E{b(*,£k)}  =  —  VU(-)  and  of-)  =  I,  (4.6)  is  a 
stochastic  approximation  version  of  the  Langevin  algorithm  with  noisy 
measurements  of  VU(*).  We  shall  refer  to  the  Monte  Carlo  algorithm  which 
simulates  the  sample  paths  of  {Xk}  as  Kushner ’s  algorithm. 

We  remark  that  the  conditions  under  which  the  above  results  are 
obtained  typically  include 

(i)  U(*)  has  continuous  second-order  partial  derivatives 

(ii)  The  local  minima  of  U(-)  consist  of  a  finite  number  of  compact 
sets;  for  Gidas’  result  it  is  actually  required  that  the  local  minima 
be  isolated  and  nondegenerate. 

These  assumptions  are  stronger  than  these  assumed  in  Theorem  3.4,  where  it 
was  only  required  that  U(-)  be  continuous  on  a  compact  metric  space.  Of 
course  the  conclusion  of  Theorem  3.4  is  only  that  the  annealing  algorithm 
visits  a  given  neighborhood  of  S  infinitely  often  with  probability  1,  whereas 
the  above  results  show  convergence  of  the  Langevin  algorithm  to  S  in 
probability. 


In  this  Chapter  we  shall  examine  certain  issues  concerning  the  Langevin 
and  annealing  algorithms  which  seem  important  to  us  and  apparently  have 
not  been  considered  elsewhere.  We  proceed  as  follows.  We  have  seen  that 
the  motivation  behind  the  annealing  and  Langevin  algorithms  is  quite  similar. 
The  first  question  we  would  like  to  answer  is: 

•  what  more  can  be  said  about  the  relationship  between  the 
annealing  and  Langevin  algorithms? 

In  4.2  we  shall  show  that  an  annealing  chain  driven  by  white  Gaussian  noise 
converges  in  a  certain  sense  to  a  Langevin  diffusion.  Now  it  seems  clear  that 
the  annealing  algorithm  and  the  Langevin  algorithm  each  have  certain 
advantages.  The  Langevin  algorithm,  for  example,  looks  like  (for  large  time 
and  small  temperature)  a  gradient  descent  algorithm,  and  gradient  descent 
algorithms  and  their  higher  order  generalizations  such  as  Newton’s  algorithm, 
which  are  “local”  algorithms  in  the  sense  that  they  use  only  the  value  of  the 
objective  function  and  a  finite  number  of  derivatives  at  the  current  iterate  to 
obtain  the  next  iterate,  are  efficient  at  finding  local  minima.  The  annealing 
algorithm,  on  the  other  hand,  is  not  strictly  “local”  in  that  it  uses  the  value 
of  the  objective  function  in  some  set  containing  the  current  iterate  to  obtain 
the  next  iterate.  In  this  sense,  the  annealing  algorithm  might  be  called 
“semilocal”  or  even  “global”  depending  on  how  much  of  the  objective 
function  is  used.  Following  the  usual  thinking  behind  both  the  annealing  and 
Langevin  algorithms,  the  idea  is  to  make  large  fluctuations  initially  and  small 
descent-like  moves  eventually.  In  view  of  these  considerations,  the  second 
question  we  would  like  to  answer  is: 

•  is  there  a  natural  hybrid  algorithm  whose  initial  behavior  resembles 
the  annealing  algorithm  an  whose  large  time  behavior  is  similar  to 
the  Langevin  algorithm? 

In  4.3  we  propose  such  an  algorithm  based  on  the  results  of  4.2. 

4.2  Convergence  of  the  Annealing  Chain  to  a  Langevin  Diffusion 

In  this  Section  we  shall  examine  the  relationship  between  the  annealing 
and  Langevin  algorithms.  We  shall  show-  using  a  result  of  Kushner's  [22]  on 
the  weak  convergence  of  interpolated  Markov  chains  to  diffusions  that  a 
parameterized  family  of  annealing  chains  driven  by  white  Gaussian  noise 
interpolated  into  piecewise  constant  processes  converge  weakly  to  a  time- 
scaled  solution  of  the  Langevin  equation.  The  weak  convergence  here  is  in 
the  sense  that  the  probability  measures  induced  by  the  interpolated  chains  on 
the  path  space  of  functions  without  discontinuities  of  the  second  kind 


j*  w*  .*  y  v*  v*  ^  ^  ^  v 
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converge  weakly  to  the  probability  measure  induced  by  the  limit  diffusion. 
This  technique  is  the  same  one  used  to  justify  the  popular  diffusion 
approximation  method,  whereby  a  complicated  possibly  non-Markovian 
process  is  approximated  by  a  simpler  diffusion  process  (c.f.  [23^ ). 

Let  Dr[0,Tj  denote  the  space  of  Revalued  cadlag  functions  on  0,T  with 
0  <  T  <  oo,  i.e.,  functions  which  are  right-continuous  on  |0,T’,  have  left-hand 
limits  on  (0,T],  and  are  left  continuous  at  T.  The  following  elementary  results 
on  weak  convergence  of  probability  measures  may  be  found  in  [2  .  There  is  a 
metric  d-j-fv)  on  Dr[0,T]  with  respect  to  which  Drj0,T]  is  a  complete  separable 
metric  space,  and  if  f(')GDr[0,T]  and  {fa(‘)}  is  a  sequence  in  Dr[0,Tj  then  the 
convergence  of  fn(*)  to  f(*)  in  Dr[0.Tj  implies  convergence  at  all  points  of 
continuity  of  f(*)  (convergence  of  fn(-)  to  f(’)  in  Dr[0,T]  is  roughly  equivalent  to 
uniform  convergence  outside  of  any  neighborhood  of  the  discontinuity  points 
of  f(‘)).  Let  £(*),  {£<(*)  :  e  >  0}  be  processes  with  sample  paths  in  Dr  0,T  ,  or 
equivalently,  random  variables  which  take  values  in  Dr'0,T',  and  let 
/i(*),  {/i4(*)  :  £  >  0}  be  the  probability  measures  they  induce  on  the  Borel 
subsets  of  Dr[0,T].  We  shall  say  that  £,(•)  converges  weakly  to  £(•)  in  Dr;0,T] 
and  write  £,(*) — ►£(’)  weakly  (in  Dr[0,T])  if  /i,(*)  converges  weakly  to  /i(*)  as 
e— *Q,  i.e.,  if 

lim  /  f(x)  d/i,(x)  =  /  f(x)  d/u(x) 


for  all  bounded  continuous  f(*)  on  Dr[0,T].  Let  Dr[0,oc)  denote  the  set  of  Rr- 
valued  functions  on  [0,oc)  which  are  right-continuous  on  '0,0c)  and  have  left- 
hand  limits  on  (0,oc).  Let 


d(f.E)  -  E  7  d„(f,8) 

n-1  ^ 


V  f,gGDr[0,oc)  . 


d(*,*)  is  a  metric  on  Dr[0,oo)  with  respect  to  which  Dr  0,ce)  is  a  complete 
separable  metric  space,  and  we  can  define  the  weak  convergence  of  processes 
with  sample  paths  in  Dr[0,cc)  analogously  to  Dr'0,Tj  w'ith  T  finite. 

Suppose  £,(•) — +£{')  weakly  (in  Dr[0,Tj)  as  e — O  with  0  <  T  ^  x.  Then  it 
can  be  shown  that  the  set  of  points  t£'0,T]  such  that  //({£(t_)  *  w(t)})  >  0  is 
at  most  countable.  Let 


C  =  {te[0,T]  :  /i(K(t_)  *  sc(t)})  =0}  . 

Then  it  can  also  be  shown  that  for  any  points  t,,...,tkGC  the  multivariate 
distributions  of  {£,(ti),...,£,(tk)}  converge  to  the  multivariate  distributions  of 
{ ^(t-l ),... T^(tk) }  as  £ — ► 0 .  But  the  weak  convergence  of  £,(*)  to  £(•)  says  much 
more  than  this:  if  f(*)  is  a  continuous  functional  on  Dr  0,T  (or  just  ft- a.s. 


continuous)  then  f(ft(*)) — >-f(£('))  weakly  as  e — ►(). 

Let  Cr[0,T]  denote  the  space  of  ^-valued  continuous  functions  on  [0,T] 
with  0  <  T  <  oo.  If  we  equip  Cr[0,T]  with  the  uniform  topology  for  T  <  co 
and  with  the  topology  of  uniform  convergence  on  compacts  for  T  =  oc,  then 
Cr[0,T]  is  a  complete  separable  metric  space  and  we  can  define  weak 
convergence  of  processes  with  sample  paths  in  Cr[0,T].  Our  reason  for  using 
Dr[0,Tj  is  simply  that  we  shall  make  use  of  Kushner’s  result  on  the  weak 
convergence  of  Markov  chains  interpolated  into  Dr[0,T].  Kushner’s  stated 
reason  for  working  with  Dr[0,T]  as  opposed  to  Cr[0,T]  is  that  it  is  easier  to 
verify  tightness  (relative  compactness)  for  a  sequence  of  probability  measures 
on  the  Borel  subsets  of  Dr[0,T|.  If  the  limit  process  is  a  jump  diffusion  then  of 
course  it  would  be  necessary  to  work  with  Dr[0,T],  but  this  is  not  an  issue 
here  since  our  limit  processes  are  assumed  to  be  ordinary  (continuous  sample 
paths  with  probability  one)  diffusions. 

We  now  set  up  the  notation  necessary  to  state  Kushner’s  Theorem  on  the 
weak  convergence  of  interpolated  Markov  chains.  It  will  be  notationally 
convenient  in  the  sequel  to  assume  that  all  processes  are  defined  on  a  common 
probability  space  (ft,  F,  P)  and  we  shall  do  so  without  further  comment.  Let 
0  <  T  <  oo.  Let  F(v)  and  F€(v)»  f  >  0,  be  If-valued  Borel  functions  on 
tfx[0,T],  and  let  G(v)  and  G<(v),  e  >  0,  be  rxr  matrix-valued  Borel  functions 
on  Bfx[0,T].  For  each  e  >  0  let  {££}  be  a  Markov  chain  with  state-space  E1 
such  that 

E{«+,  -«l?k}-F,(«,k£)£, 

E{(«+,  -  «)8>(«+l  -  ffikk)  “  G ,(«,k0  Gte.kejf  , 

with  probability  one.  Interpolate  {££}  into  a  process  £,(•)  with  sample  paths 
in  Dr[0,T]  by 

£<(t)  =  £k  V  (k_1)e  <  t  <  ke  ,  Vk  =  1,..., 

Here  is  Kushner’s  Theorem  in  slightly  modified  form. 

Theorem  4.1  (Kushner  [22]).  Assume 

(Kl)  F(v),  G(v)  are  bounded  and  continuous 
(K2)  F,(v),  G,(v)  are  uniformly  bounded  for  small  e  >  0 


(K3)  E  S  f|F<(«,kt)— F(«,k£p  +  |G,(«,k£)-G(«,k<)P 


e  I  — ►  0 


as  e — *0 


(K4)  E  £  [  lfk+i— F((fk>ke)e|2 

k— 1  L 


as  e — >0  for  some  a  >  0. 

Let  v(*)  be  a  standard  r-dimensional  Wiener  process  and  assume  that 

d£(t)  =  F(£(t),t)dt  +  G(£(t),t)dv(t),  0  <  t  <  T  , 

has  a  unique  solution  £(*)  (in  the  sense  of  multivariate  distributions)  with 
initial  condition  £(0)  =  £0.  Assume  that 

— ►Co  weakly  as  e— +0  . 


Then 


£<(*)—♦£(*)  weakly  (in  Dr[0,T])  as  e— +0  . 


Consider  now  the  following  family  of  Markov  chains.  Let  U(’)  and  T(') 
be  defined  as  in  4.1.  For  each  e  >  0  let  {z£}  be  a  Markov  chain  with  state 
space  IF  and  1-step  transition  functions  (P£(v)}  given  byf 

Pk(x»A)  =  /  s£(x,y)  dN(x,eI)(y)  +  7k(x)  6(x, A)  (4.7) 

A 

for  all  xClF  and  A£Br,  where 


faee  Chapter  3  for  general  state  space  Markov  chain  notation 


r 


B 


sk(x»y)  = 


if  U(y)  >  U(x) 
if  U(y)  <  U(x)  , 


(4.8) 


7k(x)  =  1  -  /  s£(x,y)  dN(x,eI)  (y)  ,  (4.9) 

and  <5(x,*)  is  the  unit  measure  concentrated  at  x,  for  all  x,y£l.r.  Comparing 
(4.7)  and  (3.2)  it  is  seen  that  {z£}  is  infact  an  annealing  chain  of  the  type 
introduced  in  Chapter  3  with  state  space  the  measure  space  (£,5,0)  where 
£  —  If,  5  =  Br,  0(*)  is  Lebesque  measure,  and 

Q(x,A)  =  /  q(x,y)  0(dy)  =  N(x,el)  (A)  V  AQBr 

A 


(hence  the  annealing  chain  is  “driven”  by  white  Gaussian  noise), 
convenient  to  introduce  the  following  notation.  For  each  e  >  0  let 


s(x,y,t)  =  ' 

exp 

U(y)  -  U(x) 
T(t) 

1 

if  U(y)  >  U(x) 
if  U(y)  <  U(x)  , 


It  will  be 


7((x,t)  =  1  -  /  s(x,y,t)  dN(x,eI)  (y)  , 
for  all  x,y£jf  and  t  >  0,  and  let 

P <(x,A,t)  =  /  s(x,y,t)  dN(x,eI)  (y)  +  7e(x,t)  5(x,A) 

A 


for  all  x£If ,  A£Br,  and  t  >  0.  Then 

P((x,A,ke)  =  Pk(x,A)  V  xElf  ,  V  A£Br  . 
For  each  e  >  0,  x€If  and  t  >  0  let 


b«(x,t)  «  ~  /  (y  -  x)  P ((x,dy,t)  , 

a<(x,t)  =  “  /  (y  -  x)  <g)  (y  -  x)  P ,(x,dy,t)  , 


and  £7((x,t)  be  a  positive  square  root  of  a((x,t)  i.e. 

<7{(x,t)cr'(x,t)  =  a((x,t)  . 

Since  P((*,*,ke)  =  P£(v)  is  a  (regular  wide-sense)  conditional  distribution  for 
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zk+i  glven  zk> 

E{zk+i  ~  zk!zk}  =  bt(z^,ke)e 

E{(zk+i  ~  zk)  <8>  (zk+i  -  zk)lzk)  =  ^(zk,ke)a-'(z^,ke)e 

with  probability  one.  Interpolate  {z£}  into  z,(*)  with  sample  paths  in  Dr[0,T] 

by 


ze(t)  =  i{  V  (k-l)e  <  t  <  ke  ,  Vk  =  1,...,  —  . 

Here  is  our  convergence  theorem. 

Theorem  4.2  Assume 

(Al)  U(*)  is  continuously  differentiable,  VU(‘)  is  bounded  and  Lipshitz 
(A2)  T(’)  is  continuous 

Let  w(’)  be  a  standard  r-dimensional  Wiener  process,  and  let  z(’)  be  a  solution 
off 


dzW  =  -  -  dt  +  dw(t)  ,  0  <  t  <  T  , 

with  initial  condition  z(0)  =  z0.  Assume  that 

zi“*zo  weakly  as  e— >0  . 


(4.10) 


Then 


zt(*)— ►z(*)  weakly  (in  Dr[0,T])  as  e— >0  . 


Remark  Let  r(*)  be  a  solution  of 

f(t)  =  2T(7(t))  ,  0  <  t  <  T  , 

and  let 

z(t)  =  z(r(t))  ,  T(t)  =  T(r(t))  ,  0  <  t  <  T  . 

Clearly,  z(")  is  a  Markov  process  with  continuous  sample  paths  with 
probability  one.  Now  by  standard  calculations 


fit  is  well  known  that  (4.10)  has  a  (strongly)  unique  solution  under  (Al),  (A2) 


*.  ,v.v.v;v>.v  • -.v(vv >>■.«/ vv\-  .\-\-v ^  v"  .  .  N.  "' 
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E{z(t+h)  -  z(t) Jz ( t) }  =  -  VU(z(t))h  +  0(h3/2) 

*  t  +  h 

E{(z(t-fh)  —  z(t))0  (z(t+h)  —  z(t))(z(t)}  =  f  v2T(s)  dsT  +  0(h2) 

t 

as  h — >0,  uniformly  for  0  <  t  <  T,  with  probability  one.  Hence  by  a  Theorem 
of  Doob’s  [6,  p.  288]  there  exists  a  standard  r-dimensional  Wiener  process  w(*) 
such  that  z(*)  is  the  solution  of 

dz(t)  =  -  VU(z(t))dt  +  V2t(t)  dw(t)  ,  0  <  t  <  T  .  (4.11) 

Hence  the  interpolated  annealing  chain  zf(')  converges  weakly  to  z(’),  which  is 
infact  a  time-scaled  solution  of  the  Langevin  equation  (4.11). 

We  shall  need  several  lemmas  before  we  can  apply  Theorem  4.1  to  prove 
Theorem  4.2.  Let 

if  (VU(x),  y— x)  >  0 
if  (VU(x),  y-x)  <  0  (4‘12) 

for  all  x,y€KT  and  t  >0. 

Lemma  4.1  Assume  (Al),  (A2).  Then  there  exists  a  constant  K  such  that 
|s(x,y,t)  —  s(x,y,t)  (  <  K  (y— x  p  ,  V  x,yeRr  ,  V  0  <  t  <  T  . 


s(x,y,t)  = 


exp 

(VU(x),  y-x) 

T(t) 

1 

Proof  Let 

f(x,y)  =  U(y)  -  U(x)  -  (VU(x),  y-x)  V  x,yGRr  . 

By  the  Mean  Value  Theorem  and  (Al)  there  exists  a  constant  c  such  that 

|f(x,y)  |  <  c  [y— x  p  V  x,yGRr  , 

and  by  (A2)  there  exists  a  constant  K  such  that 

<  Kjy-xp  V  x,yer  ,  V  0  <  t  <  T  . 

T(tJ 

Suppose  U(y)  —  U(x)  >  0  and  (VU(x),  y— x)  <  0.  Then  |U(y)  —  U(x)|  <  |f(x,y)| 
and  since  |l  —  ex|  <  Jx|  for  x  <  0 


|s(x,y,t)  -  s(x,y,t)|  =  1  -  exp  - 


-  U  x 


lf(x,y)l 
"  T(t) 


<  K*  (y— x  |2  V  x.yGB1'  ,  V  0  <  t  <  T  . 


The  same  inequality  holds  if  U(y)  —  U(x)  <  0  and  (VU(x),  y— x)  >  0.  Suppose 
that  U(y)  —  U(x)  >  0  and  (VU(x),  y— x)  >  0.  Then 


I 

|s(x,y,t)  -  s(x,y,t)|  <  1  -  exp  - 


<  Iffx.y)  1 

~  T(t) 

<  K-Jy-xP  V  x.yGJf  ,  V0<t<T. 

The  Lemma  follows  by  combining  the  various  cases.  □ 

The  following  two  Lemmas  provide  the  crucial  estimates  of  the  local  drift 
bt(v)  and  local  covariance  ae(y)  of  z€(*).  The  simple  estimate 

/  bP  dN(0,d)  (y)  =  0(e“/2)  as  e— K) 

for  nGUJwill  be  used  frequently  in  the  sequel. 

Lemma  4.2  Assume  (Al),  (A2).  Then 


b{(x,t)  =  -  2^tyj  +  0(e1/2)  as  e—0  , 
uniformly  for  xGJf ,  0  <  t  <  T. 

Proof  By  Lemma  4.1  there  exists  a  constant  K  such  that 

|s(x,y,t)  —  s(x,y,t)|  <  K]y— x|2  V  x,yGl.r  ,  V  0  <  t  <  T  . 

Hence 


bf(x,t)  =  —  /  (y-x)  P<(x,dy,t) 

=  ~  J  (y-x)  s(x,y,t)  dN(x,cI)  (y) 

=  jf  (y-x)  s(x,y,t)  dN(x,eI)  (y) 

+  ~  f  (y-x)  s(x,y,t)  -  s(x,y,t)  dN(x,eI)  (y) 

=  j  /  (y-x)  s(x,y,t)  dN(x,cI)  (y)  +  0(e1/2)  as  £ — K)  , 
uniformly  for  xGlEf  and  0  <  t  <  T.  Substituting  for  s(*,v)  from  (4.12)  gives 

b«(x,t)  =  J  y  dN(0,I)  (y) 

€  (y,vu(x))<o 


+  I  yexp  ~ 


1/2  J 

6  (y.vu(x})>o 


-T(*"yJ-  e1/2  dN(0,I)  (y)  +  0(e1//2)  (4.13) 


as  e — K),  uniformly  for  x€Er  and  0  <  t  <  T.  Clearly, 

b«(x*t)  =  -  -  +  0(e1/2)  as  e— 0 

uniformly  on  {xGIf  :  VU(x)  =  0}x[0,T].  Hence  we  may  assume  that 
VU(x)  ^  0  for  all  xEfiL  Let 


aM  =  ~ 


,  /3(x,t)  = 


(4.14) 


for  all  xElf  and  t  >  0.  By  (Al),  (A2)  Qr(y)  and  /?(*,’)  are  bounded  on 
BTx[0,T].  Now  completing  the  square  in  the  second  integrand  in  (4.13)  gives 


[•It, 


b^x.t)  = 


1/F  /  y  dN(0,I)  (y) 

e  (y.vu(x))<o 


+  ~i/t  /  y  exp(a(xft)e)  dN  -  e1/2  ,1  (y)  +  0(e1/2) 

e  (y,vu(x))>o  1 11) 


-4n  J  y  dN(o,I)  (y)  + 


,1/2  J 

c  (y,vu(x))<o 


T7 1  I  y  dN(0,I)  (y) 

e  (y,vu(x))>i(x,t)f‘/2 


-  N(0,I)  {y  :  (y,VU(x))  >  /3(x,t)e1/2}  +  0(e1/2) 

-  -  -^y1  f(Vu(x),0(£‘/!)) 

+  -Jji  [6(VUM,0(£'/2))  -  g(VU(x),0)]  +  0(e‘/2)  , 

as  e— >0,  uniformly  for  x£BT  and  0  <  t  <  T,  where 

f(u,<5)  =  N(0,I)  {y  :  (y,u)  >  |u|5}  , 

g(u,5)  =  /  y  dN(0,I)  (y)  . 

(y,u)>|u|i 

To  proceed  further  we  need  to  estimate  f(v)  and  g(v)-  We  have 

f(u,<5)  =  N(0,I)  {y  :  (y,u)  >  |u|<5} 

=  N(0,1)  [<S,oo) 

=  1  _  r  - I - e-^2  df 

2  o  (27T)1/2  * 

=  —  +  0(5)  as  6 — K)  , 

2 
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uniformly  for  uElEf.  As  for  g(y),  we  make  the  following 


Claim 


g(u,5)  = 


e  *2  u 
(2t r)1/2  |u| 


for  all  5  >  0  and  uQRr\{0}. 


Suppose  the  Claim  is  true.  Then  combining  (4.16)-(4.18)  gives 

b  (x  t,  _  _  mi  +  °°(<)  -  i  VUM  . 

'  ’  1  2T(t)  +  {iTffl1  |VU(r.)|  (  * 

=  _  M  +  0(£‘/’)  as  e-*0  , 

2T(t)  v  ’ 


uniformly  for  x€If  and  0  <  t  <  T,  and  we  obtain  the  Theorem.  It  remains  to 
prove  the  Claim. 

Proof  of  Claim  Fix  u£lf\{0},  let 

and  extend  nt  to  an  orthonormal  basis  {n^...,^}  for  If.  Also  let  be 

the  standard  basis  for  If,  and  L(*)  be  the  (orthogonal)  linear  mapping  from  If 
into  If  such  that  L(ei)  =  n^  for  all  i  =  l,„.,r.  Applying  the  change  of  variable 
formula  and  using  the  fact  that  L(*)  is  an  isometry  and  the  adjoint 
L*(*)  =  L-1(*)  gives 


g(u,<5)  =  /  y  dN(0,I)  (y) 

(y,u)>|u|i 

=  /  E  (y^n;  ‘  dN(0,I)  (y) 

(y,tn)>#  i  — 1 

=  E  ni  /  (Lz>ni)  dN(0,I)  (z) 

i-l  (Lz.n,)^ 

-  E  Hi  /  (z,L*Di)  dN(0,I)  (z) 

i-l  (*,L’n,)>« 

=  E  n i  /  (*,es)  dN(0,I)  (z) 

i-l  (*,ei)>6 


= »,  /  e  dN(o,i)  (?) 


6 

oo 


1 


exp 


(-  «V2) 


d£ 


6  ^  A  VS> 0 


(2tt)1/2  |u| 


This  completes  the  proof  of  the  Claim  and  hence  the  Theorem.  □ 


s* 


Lemma  4.3  Assume  (Al),  (A2).  Then 

a,(x,t)  =  I  +  0(e1/2)  as  e — >0 
uniformly  for  xGif  and  0  <  t  <  T. 

Proof  Proceeding  as  in  the  proof  the  Lemma  4.2 
at(x,t)  =  ~  f  (y~x)<g>  (y-x)  P f(x,dy,t) 

-  ~  /  (y~x)0  (y-x)  s(x,y,t)  dN(x,eI)  (y)  +  O(e) 

=  /  y<8>y  dN(o,l)  (y) 

(y,VU(x))<0 

+  /  y^yexp  dN(0,I)(y)  +  0(e)  (4.19) 

(y,VU(x))>0  MW 

as  e — ► 0 ,  uniformly  for  xGK1  and  0  <  t  <  T.  Clearly, 

af(x,t)  =  I  +  0(e)  as  e— +0 

uniformly  on  {xGlf  :  VU(x)  =  0}x[0,T].  Hence  we  may  assume  that 
VU(x)  7*  0  for  all  xG®1.  Let  a(v)>  /?(*,•)  be  defined  as  in  (4.14).  Then 
completing  the  square  in  the  second  integrand  in  (4.19)  gives 

a<(x,t)  =  /  y<8>y  dN(0,I)  (y) 

(y.VU(*))<0 

' 

+  /  y®  y  exp  (a(x,t)e)  dN  - e1/2  ,  I  (y)  +  O(e) 

(y,VU(x))>0  MW 

=  /  y(g!y  dN(0,I)  (y)  +  /  y(g)y  dN(0,I)  (y)  +  0(e1/2) 

(y,VU(x))<0  (y,VU(x))>^(x,t)<1/2 


=  h(VU(x),0)  +  h(VU(x),0(e1/2))  +  0(e1/,z)  , 


(4.20) 


as  e — K),  uniformly  for  xGl^  and  0  <  t  <  T,  where 

h(u,<5)  =  /  y®y  dN(0,I)  (y)  . 

(y,u)>|u|i 

To  proceed  further  we  need  to  estimate  h(v).  We  make  the  following 


S'*  V  N"  il"  »  *  %'  V  %*  *  *  • 


S •*  j 


Claim 


uniformly  for  uGif. 


h(u,<5)  =  —  I  +  0(5)  as  5 — >0  ,  • 
2 


(4.21) 


Suppose  the  Claim  is  true.  Then  combining  (4.20)  and  (4.21)  gives 
a((x,t)  =  I  +  0(e1/2)  as  e — d)  , 

uniformly  for  xGR1^  and  0  <  t  <  T,  and  we  obtain  the  Theorem.  It  remains  to 
prove  the  Claim. 

Proof  of  Claim  Fix  uGH^VIO}  and  let  {n1,...,nr},  {e1(...,er},  and  L(*)  be  as  in 
the  proof  of  the  Claim  in  Lemma  4.2.  Then 

h(u,0)  -  /  y(g)y  dN(0,I)  (y) 

(y.u)>o 


-  J  E  (y.Qi)ni  ®  E  (y.nj)nj  dN(o,i)  (y) 

(y.ni)>0  (i-1  J  (j-1 

=  E  /  (L*,nj)  (Ls.nj)  dN(0,I)  (y) 

ij—l  (Lz,iii)>0 

=  V  n^nj  /  (z,L‘ni)  (z,L*nj)  dN(0,I)  (y) 

i  J— 1  (i,L*ni)>0 

=  E  ni<8>nj  J  (z>ei)  (z>ej)  dN(0,I)  (y) 

i.j-1  (*,ei)>0 

=  E  ni&ni  /  (z>ei)2  dN(0,I)  (y) 

i-1  (*,ei)>0 

°°  o  r 

=  n^nj  /  ?  dN(0,l)  (0  +  £  n,(g) ns  N(0,l)  (O.oo) 


=  —  E  ni®ni 

Z  i-l 


-ii 

2 


(4.22) 


Similarly, 


AV.VVVVVVVV.'/'-y.'/V V  \-  V.V.V.'.-  -■  V  V  V  •  \ 
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v  / 


.  I*<  1  k.4 


,  |1  LlljLIhl  l.'i.'-l.1  .<1  A  ■>!  ■*«  «*«  ■**  -'»  WjLj'UU  M  M  1.1 


h(u,0)  —  h(u,5)  = 


/  y0y  dN(0,I) 

0<(y,u)<|u  |<5 


—  E  ni0ni  /  (zfe,)2  dN(0,I)  (z) 

i— 1 

‘  , 

=  nj^n,  /  £2dN(0,l)  (0  +  £  ^0  n,  N(0,1)  [0,5) 

0  i-2 

*  , 

=  n,0n,  /  £2  - —  exp  (-  £2/2)d£ 

o  (2tt)^ 

r  <  i 

+  £  nj®^  /  —  --y2-  exp  (-  £2/2)  d£ 

i-2  0  (2tt) 

=  n10ni*o(53)  +  E  ni®ni  *  0(5) 

i-2 

=  0(5)  as  5 — >0  . 


(4.23) 


Combining  (4.22)  and  (4.23)  completes  the  proof  of  the  Claim  and  hence  the 
Theorem.  □ 

Lemma  4.4  Assume  (Al),  (A2).  Then 

<rf(x,t)  =  I  +  O'e1/2)  as  e — >0 
uniformly  for  xGB1  and  0  <  t  <  T. 

Proof  By  Lemma  4.3 

Mx,t)  =  I  +  0(e^2)  as  e— *0  (4-24) 

uniformly  for  xGBf  and  0  <  t  <  T.  Since  a((x,t)  is  self-adjoint,  there  exists  an 
orthogonal  matrix  L((x,t)  such  that 

ae(x,t)  =  L{(x,t)  A,(x,t)  L'(x,t)  (4.25) 


where 


A((x,t)  =  diag  (X()1(x,t)l...,\_r(x,t))  (4.26) 

and  the{X(  i(x,t)  :  i  =  l,...,r)  are  the  eigenvalues  of  a,(x,t),  i.e.,  the  solutions 
of  det(Xl  —  a{(x,t))  =  0.  Now  if  A  =  [a^]  is  a  real  rxr  matrix  then  det  A  may 
be  expressed  as 
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det  A  =  2  sgn(p)*a.ipi  —  arpr  (4.27) 

P  -  !p i}€-P 

where  P  is  the  set  of  permutations  of  {l,...,r}.  Setting  A  =  XI  —  a,(x,t)  and 
combining  (4.24)  and  (4.27)  gives 

det(\I  -  a,(x;t))  =  (X  -  l)r  +  (X  -  Ip1  Op/2)  +  -  +  0(er/2) 

and  so 

|X,.,(x,t)  -  If  -  0(max{|\.,(x,t)  -  if1  t'n  ,  er/2}) 
and  consequently 

\(i(x,t)  =  1  +  0(e1/2)  , 
and  since  (1  4-  5)1/2  =  14-  0(5)  as  5 — 4), 

X<11{2(x,t)  =  1  +  Op/2)  as  6 — K)  j  (4.28) 

uniformly  for  xEB*  and  0  <  t  <  T.  Let 

^(x,t)  =  L,(x,t)  A{1/2(x,t)  L((x,t)  . 

Then  by  (4.25)  a<(x,t)  =  <r((x,t)  o^x.t),  and  by  (4.26),  (4.28),  and  the  Schwartz 
inequality, 

<T<(x,t)  =  I  +  Op/2)  as  e — K)  , 
uniformly  for  xGlf  and  0  <  t  <  T,  as  required.  □ 

Proof  of  Theorem  4.2  We  shall  apply  Theorem  4.1  with  ££  =  z £, 
&(*)  =  *«(*)>  £(0  =  z(*)>  and 

F(x,t)  =  -  ,  F,(x,t)  =  b,(x,t) 

G(x,t)  =  I  ,  G((x,t)  =  <7((x,t)  . 

In  view  of  (Al),  (A2)  and  Lemmas  4.2  and  4.4,  (Kl)  and  (K2)  are  satisfied. 
Now  by  Lemmas  4.2  and  4.4  there  exists  a  constant  c  such  that  for  small 
enough  e  >  0 

Mx’t)  +  ^w 

k(x,t)  -  1 1  <  c  e1/2  , 
for  all  xGB.r  and  0  <  t  <  T.  Hence 


E  E  Mzk>ke)  +  1  +  k(zk>k£)  -  if  £ 

k— 1  zMKeJ 


[V<\ 

<  2  2c2£2  (e  sma11) 

k— 1 


<  2czTe  — ►  0  as  e— 


and  so  (K3)  is  satisfied. 

It  remains  to  check  (K4).  Since  P£(v)  is  a  conditional  distribution 
function  for  z£+1  given  z£  we  have  that  for  every  n£5I 

E<K*,  -  *kl"}  =  E{E{|zi+l  -  kifkk}} 

“E {/  b  -  2kP  pkK.  dy)} 

<E{/  H"|dN(0,fI)(y)} 

<  f"/2 

for  some  constant  cn.  Hence  using  the  uniform  boundedness  of  b((v)  for 
small  e 


M 

e  S  bk+1 

k-1 


M 

zk  -  b((zk»k£)£P  <  E  d  *  e2 

k-l 


(e  small) 


<  dT  •  e— *0  as  e—*0 


for  some  constant  d,  and  so  (K4)  is  satisfied  with  a  =  2.  The  Theorem  now 
follows  from  Theorem  4.1.  □ 
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4.3  Hybrid  Annealing/Langevin  Algorithm 

In  this  Section  we  shall  give  a  hybrid  annealing/Langevin  algorithm 
whose  initial  behavior  resembles  that  of  the  annealing  algorithm  and  whose 
large  time  behavior  is  similar  to  the  Langevin  algorithm.  The  development  of 
this  algorithm  will  be  guided  by  Kushner’s  algorithm  and  the  results  of  4.2  on 
the  relationship  between  the  annealing  algorithm  and  the  Langevin  algorithm. 
We  note  that  the  discussion  in  this  Section  is  heuristic  at  points  and  more 
work  need  to  be  done. 

We  shall  make  use  of  the  notation  introduced  in  4.1,  4.2.  We  shall 
assume  that 


T(t)  = 


(t  large) 


where  c  is  a  positive  constant. 

We  start  by  considering  Kushner’s  algorithm  (4.6)  with  b(x,£)  =  —  VU(x) 
and  <t(x)  =  I,  i.e., 

Xk+1  =  Xk  —  ak  VU(Xk)  +  \/2  akwk  (4.29) 

where 

“  TSiT  (k  large)  • 

Kushner  [21]  has  shown  (roughly)  that  if  c  is  large  enough  and  the  sample 
paths  {Xk}  are  bounded  with  probability  one  by  some  device,  then  Xk 
converges  to  S  in  probability. 

Now  consider  the  discretization  (4.5)  of  the  Langevin  algorithm  (4.1)  with 
discretization  interval  e,  i.e., 

xk+l  =  xk  -  e  VU(xk)  +  V2T(ke)e  wk  .  (4.30) 

Interpolate  {;<k}  into  x((*)  with  sample  paths  in  Dr[0,oc)  by 

x((t)  =  xk  V  (k-l)c  <  t  <  ke  ,  VkG®. 

An  application  of  Theorem  4.1  under  assumptions  (Al),  (A2)  of  Theorem  4.2 
shows  that 

x((*)  — ►  x(‘)  weakly  (in  Dr[0,oc))  as  e — *-ec  . 

Suppose  in  (4.30)  we  replace  the  fixed  discretization  interval  t  by  a f .  a:.  :  • 
accumulated  time  ke  by 
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k— l 

=  S  aa  , 

n-X 

and  define 

Xjj+i  =  Xk  —  ak  VlJ(Xk)  ■+■  "\/ 2T(tk)ak  wk  . 
By  L’hopital’s  rule  and  the  Fundamental  Theorem  of  Calculus 

T(tk)  = - 

log  £  an 
n-1 


as  k — >oo.  Hence  we  may  write 

Xk+i  —  Xk  —  ak  VU(Xk)  +  \^2  akwk  (4.31) 

where  ak  ~  ak.  In  view  of  (4.29)  and  (4.31)  it  seems  clear  that  we  may 
identify  {Xk}  and  {Xk}  as  essentially  the  same  algorithm,  and  so  we  can  view 
{Xk}  as  arising  from  a  discretization  {xk}  of  the  Langevin  algorithm  x(*)  with 
a  nonstationary  discretization  interval  e  =  ak,  at  least  for  k  large  enough. 
Note  that  the  weak  convergence  of  xc(*)  to  x(*)  in  Dr[0,oo)  as  t — >0  does  not 
imply  that  {Xk}  and  x(*)  (and  presumably  {Xk}  and  x(*))  have  “close” 
asymptotic  measures,  from  which  we  might  conclude  that  the  convergence  of 
Xk  to  S  in  probability  as  k— *oo  follows  from  the  convergence  of  x(t)  to  S  in 
probability  as  t— kx>  (c.f.  [23]  for  a  discussion  of  asymptotic  measures  and  the 
relationship  to  weak  convergence).  However,  the  weak  convergence  of  xe(*)  to 
x(‘)  in  Dr[0,oo)  as  e— »0  and  the  convergence  of  x(t)  to  S  in  probability  as 
t— *oo  does  provide  a  certain  intuitive  basis  for  the  convergence  of  Xk  to  S  as 
k — kx>  in  probability,  which  infact  Kushner  proves. 

Using  the  above  interpretation  of  Kushner’s  algorithm  (4.29)  as  a  certain 
discretization  of  the  Langevin  algorithm  (4.1)  we  now  proceed  to  construct  a 
hybrid  annealing/Langevin  algorithm.  For  each  e  >  0  define  an  If -valued 
discrete  parameter  process  {yk}  as  follows.  Let 


where  {mk}  is  a  sequence  of  {0,l}-valued  random  variables  such  that  mk  is 
conditionally  independent  of  y/,...,yk_|,  wi>...»wk-i>  and  mi,...,mk_i  given 

yk»  wk>  and 


V  V  V  V  V-V  V  V  V  V  V  V  V  V  *  *  •  .  .'*>s  ••  •'.  •  .'.'.'.v  - 


*  j.  * » *  *  *  • 


where  we  use  the  notation  [a]+  =  max{0,a}  for  aQEL  A  calculation  shows  that 
{yk}  is  a  Markov  chain  and 

p{yk+i€Ajy£  =  y }  =  /  s£(y,z)  dN(y,2T(ke)eI)  (z)  +  ^(z)  S( z,A)  (4.32) 

A 

for  all  y€KT  and  AGBr,  where  sk(v)  is  given  by  (4.8)  and  7k(*)  is  given  by  the 
r.h.s.  of  (4.9)  with  el  replaced  by  2T(ke)eI.  Comparing  (4.32)  and  (4.7)  we  see 
that  {yk}  like  {zk}  is  an  annealing  chain  driven  by  white  Gaussian  noise, 
except  that  the  noise  driving  {yk}  is  nonstationary  with  covariance  2T(ke)eI. 
Interpolate  {yk}  into  yt(')  with  sample  paths  in  Dr[0,oo)  by 

y<(t)  =  yk  V(k— l)e  <  t  <  ke  ,  V  keN. 

In  Theorem  4.2  we  gave  conditions  such  that 

zt(*)— »-z(*)  weakly  (in  Dr[0,T])  as  e— >0  ; 

minor  changes  in  the  proof  of  Theorem  4.2  show  that 

y<(*)— ►x(-)  weakly  (in  Dr[0,oo))  as  e—*0 

under  the  same  conditions. 

Now  define  an  Bf-valued  discrete  parameter  random  process  {Yk}  as 
follows.  Let 

Yk+1  =  Yk  +  akMkwk 

where  {Mk}  is  a  sequence  of  {0, 1 }- valued  random  variables  such  that  Mk  is 
conditionally  independent  of  w1,...,wk_i,  and  M1,...,Mk_1  given 

Yk,  Wjj,  and 

nfw  _  i  fv  _ 1  1  [U(y  +  V2  akw)  -  U(y)]+  1 


P{Mk  =  l|Yk  =  y  ,  wk  =  w}  =  exp  j- 


By  similar  reasoning  as  with  {Xk}  we  may  view  {Yk}  as  arising  from  a 
discretization  {yk}  of  the  Langevin  algorithm  x(*)  with  a  nonstationary 
discretization  interval  e  =  ak,  at  least  for  k  large  enough.  We  shall  call  the 
algorithm  which  simulates  the  sample  paths  of  {Yk}  the  hybrid 
annealing / Langevin  algorithm. 

We  shall  now  make  a  few  comments  concerning  the  convergence  of  the 
hybrid  annealing/Langevin  algorithm.  The  weak  convergence  of  y((*)  to  x(’) 
in  Dr[0,oo)  as  e— +0  and  the  convergence  of  x(t)  to  S  in  probability  as  t— *x> 


does  provide  an  intuitive  basis  for  the  convergence  of  Yk  to  S  in  probability  as 
k— »oo.  This  intuition  is  further  bolstered  by  the  convergence  of  Xk  to  S  in 
probability  as  k — >oo.  Unfortunately,  we  have  not  been  able  to  establish  the 
convergence  of  {Yk}.  One  approach  which  might  be  fruitful  is  to  try  to  adapt 
Kushner’s  proof  of  the  convergence  of  {Xk}  (we  have  not  tried  this).  Our  idea 
which  we  did  not  succeed  in  developing  was  to  try  to  obtain  the  asymptotic 
behavior  of  the  {Yk}  process  directly  from  the  asymptotics  of  the  related  x(') 
process.  This  is  similar  in  some  respects  to  the  associated  ODE  method  used 
to  analyze  stochastic  approximation  algorithms  (c.f.  [24]),  whereby  the 
asymptotics  of  the  stochastic  approximation  algorithm  are  obtained  from  the 
asymptotics  of  the  “limit’1  process  which  satisfies  an  ordinary  differential 
equation.  However  in  our  problem  the  “limit”  process  x(*)  satisfies  the 
stochastic  differential  equation  (4.1).  Without  going  into  details  it  now 
appears  to  us  that  the  nonstationarity  of  x(*)  makes  it  very  difficult  (if  not 
impossible)  to  extend  the  associated  ODE  method  to  prove  convergence  of 

{Yk}. 

It  is  interesting  to  compare  the  1-step  transition  probabilities  for  {Xk} 
and  {Yk}.  Let  iV(m,A)(’)  be  an  r-dimensional  Gaussian  density  with  mean  m 
and  positive  definite  covariance  A,  i.e. 


"(m>A)K)  “  A)‘/’ 

for  all  fEBT .  Then  we  may  write 


•  exp 


[-  (f-m,  A-'K-m))/2] 


P{Xk+16A|Xk  =  v}  =  {  IM  <15 

A 

P{Yk+1€A|Yk  =  r,}  =  /  g(V,m  +  l(v)  %,A) 

A 


for  all  776BT  and  AEBr,  where 


fM  =  N(r?+akVU(r7),2ak2I)  (0 


s{V,t)  =  exp  \  - 


|u(5)  -  uffll  1  .  , 


/%,2akJI)  (Q 


'iiv)  -  i  -  /  g(ij,5)  d5 


for  all  rj,  fEE*’.  In  Figure  4.1(a)  we  show  a  bimodal  U(’)  defined  on  B.  The 
points  £2i  £3  are  solutions  of  U(£)  =  U (rj)  for  a  fixed  77.  In  Figure  4.1(b)  we 
sketch  f(7?,*)*  In  Figure  4.1(c)  we  sketch  g(7?,*);  we  also  show  the  unweighted 
Gaussian  density  N(rj, 2ak)  (•)  and  the  atom  at  77  with  mass  ^(77).  These 


CHAPTER  V 
CONCLUSIONS 


5.1  Summary  of  Results 

We  summarize  the  results  of  this  thesis  as  follows. 

(i)  We  analyzed  the  rate  of  convergence  in  probability  of  the  annealing 
chain  for  a  special  case  of  an  energy  function  with  two  local  minima.  We 
obtained  convergence  rates  for  nonparametric  temperature  schedules 
(Theorem  2.8),  and  also  for  parametric  temperature  schedules  Tk  =  c/log  k 
for  c  >  A  where  A  is  Hajek’s  optimal  constant  (Corollary  2.2).  There  are 
two  factors  which  limit  the  rate  of  convergence  in  probability.  One  factor 
corresponds  to  the  rate  at  which  the  annealing  chain  makes  transitions  from 
one  local  minimum  to  the  other  and  back.  For  temperature  schedules 
Tk  «=  c/log  k  this  factor  dominates  whenever  c  >A  .  The  other  factor 
corresponds  to  the  rate  at  which  the  annealing  chain  makes  its  first  transition 
from  the  strictly  local  minimum  to  the  global  minimum.  For  temperature 
schedules  Tk  =  c/log  k  this  factor  is  only  important  when  c  =  A  .  We  gave 
explicit  expressions  for  the  characteristic  time  scales  associated  with  each  of 
the  rate  limiting  factors. 

(ii)  We  analyzed  the  sample  path  properties  of  the  annealing  chain.  We 
gave  conditions  such  that  the  annealing  chain  visits  the  set  S  of  globally 
minimum  energy  states  with  probability  one  (Theorem  2.9),  visits  S  with 
probability  strictly  less  than  one  (Theorem  2.10),  and  converges  to  S  with 
probability  one  (Theorem  2.11). 

(iii)  We  gave  a  modification  of  the  annealing  algorithm  so  as  to  allow  for 
noisy  measurements  of  the  energy  differences  which  are  used  in  selecting 
successive  states.  This  is  important  when  the  energy  differences  cannot  be 
measured  exactly  or  when  it  is  simply  too  costly  to  do  so.  We  focused  on  the 
case  when  at  the  kth  time  step  the  energy  difference  between  the  candidate 
and  current  states  is  measured  with  additive  Gaussian  noise  with  mean  0  and 
variance  <r*.  We  showed  that  if  =  o(Tk)  then  the  asymptotic  behavior  of 
the  modified  annealing  algorithm  is  essentially  the  same  as  that  of  the 
unmodified  annealing  algorithm  (Theorem  2.12,  Corollary  2.3). 


(iv)  We  extended  the  annealing  algorithm  for  optimization  on  general 
spaces.  We  generalized  our  result  on  the  finite  state  annealing  chain  visiting 
the  set  S  of  globally  minimum  energy  states  with  probability  one  (Theorem 
2.9)  to  the  general  state  annealing  chain  visiting  a  neighborhood  of  S  with 
probability  one  (Theorem  3.4),  essentially  under  the  conditions  that  the  state 
space  be  a  compact  metric  space  and  the  energy  function  be  continuous. 

(v)  Our  most  important  results  concern  the  relationship  between  the 
annealing  and  Langevin  algorithms.  We  showed  that  a  parametric  family  of 
annealing  chains  driven  by  white  Gaussian  noise  and  interpolated  into 
piecewise  constant  processes  converge  weakly  to  a  time-scaled  Langevin 
diffusion  (Theorem  4.2).  Although  both  the  annealing  chain  and  Langevin 
diffusion  at  a  fixed  temperature  have  a  Gibbs  invariant  measure,  the  weak 
convergence  seems  to  us  to  be  a  rather  surprising  result.  Motivated  by  this 
convergence  result,  we  proposed  a  hybrid  annealing/Langevin  algorithm, 
whose  small  time  behavior  resembles  that  of  the  annealing  algorithm  and 
whose  large  time  behavior  is  similar  to  the  Langevin  algorithm. 

6.2  Open  Questions 

We  list  here  some  questions  which  naturally  follow  from  our  work. 

(i)  Is  their  an  extension  of  Theorem  2.8  and  Corollary  2.2  on  the  rate  of 
convergence  in  probability  of  an  annealing  chain  with  an  energy  function  with 
two  local  minima  to  energy  functions  with  an  arbitrary  number  of  local 
minima?  Also,  in  Theorem  2.8  do  the  conditions  (2.54),  (2.55)  suggest  the 
kind  of  regularity  in  the  temperature  schedule  which  guarantees  fast 
convergence  (recall  that  only  (2.53)  is  required  for  convergence)?  Also,  in  view 
of  the  relationship  discussed  in  Chapter  4  between  the  annealing  and 
Langevin  algorithms,  is  it  possible  to  establish  rates  of  convergence  similar  to 
those  in  Theorem  2.8  and  Corollary  2.2  for  the  Langevin  algorithm  with  a 
smooth  energy  function  with  two  local  minima? 

(ii)  Does  the  general-state  annealing  chain  converge  in  probability  to  the 
set  of  globally  minimum  energy  states,  assuming  only  that  the  state  space  is  a 
compact  metric  space,  the  energy  function  is  continuous,  and  suitable 
conditions  on  the  temperature  schedule? 

(iii)  Finally  and  most  importantly,  does  the  hybrid  annealing/Langevin 
algorithm  converge,  and  does  it  indeed  improve  on  the  performance  of  the 
annealing  and  Langevin  algorithms? 
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